1
|
Monroe CL, Abdelhafez YG, Atsina K, Aman E, Nardo L, Madani MH. Evaluation of responses to cardiac imaging questions by the artificial intelligence large language model ChatGPT. Clin Imaging 2024; 112:110193. [PMID: 38820977 DOI: 10.1016/j.clinimag.2024.110193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 04/22/2024] [Accepted: 05/20/2024] [Indexed: 06/02/2024]
Abstract
PURPOSE To assess ChatGPT's ability as a resource for educating patients on various aspects of cardiac imaging, including diagnosis, imaging modalities, indications, interpretation of radiology reports, and management. METHODS 30 questions were posed to ChatGPT-3.5 and ChatGPT-4 three times in three separate chat sessions. Responses were scored as correct, incorrect, or clinically misleading categories by three observers-two board certified cardiologists and one board certified radiologist with cardiac imaging subspecialization. Consistency of responses across the three sessions was also evaluated. Final categorization was based on majority vote between at least two of the three observers. RESULTS ChatGPT-3.5 answered seventeen of twenty eight questions correctly (61 %) by majority vote. Twenty one of twenty eight questions were answered correctly (75 %) by ChatGPT-4 by majority vote. Majority vote for correctness was not achieved for two questions. Twenty six of thirty questions were answered consistently by ChatGPT-3.5 (87 %). Twenty nine of thirty questions were answered consistently by ChatGPT-4 (97 %). ChatGPT-3.5 had both consistent and correct responses to seventeen of twenty eight questions (61 %). ChatGPT-4 had both consistent and correct responses to twenty of twenty eight questions (71 %). CONCLUSION ChatGPT-4 had overall better performance than ChatGTP-3.5 when answering cardiac imaging questions with regard to correctness and consistency of responses. While both ChatGPT-3.5 and ChatGPT-4 answers over half of cardiac imaging questions correctly, inaccurate, clinically misleading and inconsistent responses suggest the need for further refinement before its application for educating patients about cardiac imaging.
Collapse
Affiliation(s)
- Cynthia L Monroe
- College of Medicine, California Northstate University, 9700 W Taron Dr, Elk Grove, CA 95757, USA
| | - Yasser G Abdelhafez
- Department of Radiology, University of California, Davis Medical Center, 4860 Y St, Suite 3100, Sacramento, CA 95817, USA
| | - Kwame Atsina
- Division of Cardiovascular Medicine, University of California, Davis Medical Center, 4860 Y St, Suite 0200, Sacramento, CA 95817, USA
| | - Edris Aman
- Division of Cardiovascular Medicine, University of California, Davis Medical Center, 4860 Y St, Suite 0200, Sacramento, CA 95817, USA
| | - Lorenzo Nardo
- Department of Radiology, University of California, Davis Medical Center, 4860 Y St, Suite 3100, Sacramento, CA 95817, USA
| | - Mohammad H Madani
- Department of Radiology, University of California, Davis Medical Center, 4860 Y St, Suite 3100, Sacramento, CA 95817, USA.
| |
Collapse
|
2
|
Kooraki S, Hosseiny M, Jalili MH, Rahsepar AA, Imanzadeh A, Kim GH, Hassani C, Abtin F, Moriarty JM, Bedayat A. Evaluation of ChatGPT-Generated Educational Patient Pamphlets for Common Interventional Radiology Procedures. Acad Radiol 2024:S1076-6332(24)00307-6. [PMID: 38839458 DOI: 10.1016/j.acra.2024.05.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 05/10/2024] [Accepted: 05/13/2024] [Indexed: 06/07/2024]
Abstract
RATIONALE AND OBJECTIVES This study aimed to evaluate the accuracy and reliability of educational patient pamphlets created by ChatGPT, a large language model, for common interventional radiology (IR) procedures. METHODS AND MATERIALS Twenty frequently performed IR procedures were selected, and five users were tasked to independently request ChatGPT to generate educational patient pamphlets for each procedure using identical commands. Subsequently, two independent radiologists assessed the content, quality, and accuracy of the pamphlets. The review focused on identifying potential errors, inaccuracies, the consistency of pamphlets. RESULTS In a thorough analysis of the education pamphlets, we identified shortcomings in 30% (30/100) of pamphlets, with a total of 34 specific inaccuracies, including missing information about sedation for the procedure (10/34), inaccuracies related to specific procedural-related complications (8/34). A key-word co-occurrence network showed consistent themes within each group of pamphlets, while a line-by-line comparison at the level of users and across different procedures showed statistically significant inconsistencies (P < 0.001). CONCLUSION ChatGPT-generated education pamphlets demonstrated potential clinical relevance and fairly consistent terminology; however, the pamphlets were not entirely accurate and exhibited some shortcomings and inter-user structural variabilities. To ensure patient safety, future improvements and refinements in large language models are warranted, while maintaining human supervision and expert validation.
Collapse
Affiliation(s)
- Soheil Kooraki
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA.
| | - Melina Hosseiny
- Department of Radiology, University of California, San Diego (UCSD), San Diego, CA.
| | - Mohamamd H Jalili
- Department of radiology and biomedical imaging, Yale New Haven Health, Bridgeport Hospital, CT.
| | - Amir Ali Rahsepar
- Department of Radiology, Feinberg School of Medicine, Northwestern University, Chicago, IL.
| | - Amir Imanzadeh
- Department of Radiology, University of California, Irvine (UCI), Irvine, CA.
| | - Grace Hyun Kim
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA.
| | - Cameron Hassani
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA.
| | - Fereidoun Abtin
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA.
| | - John M Moriarty
- Department of Radiological Sciences, Division of Interventional Radiology, David Geffen School of Medicine at UCLA, Los Angeles, CA.
| | - Arash Bedayat
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA.
| |
Collapse
|
3
|
Daraqel B, Wafaie K, Mohammed H, Cao L, Mheissen S, Liu Y, Zheng L. The performance of artificial intelligence models in generating responses to general orthodontic questions: ChatGPT vs Google Bard. Am J Orthod Dentofacial Orthop 2024; 165:652-662. [PMID: 38493370 DOI: 10.1016/j.ajodo.2024.01.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Revised: 01/01/2024] [Accepted: 01/01/2024] [Indexed: 03/18/2024]
Abstract
INTRODUCTION This study aimed to evaluate and compare the performance of 2 artificial intelligence (AI) models, Chat Generative Pretrained Transformer-3.5 (ChatGPT-3.5; OpenAI, San Francisco, Calif) and Google Bidirectional Encoder Representations from Transformers (Google Bard; Bard Experiment, Google, Mountain View, Calif), in terms of response accuracy, completeness, generation time, and response length when answering general orthodontic questions. METHODS A team of orthodontic specialists developed a set of 100 questions in 10 orthodontic domains. One author submitted the questions to both ChatGPT and Google Bard. The AI-generated responses from both models were randomly assigned into 2 forms and sent to 5 blinded and independent assessors. The quality of AI-generated responses was evaluated using a newly developed tool for accuracy of information and completeness. In addition, response generation time and length were recorded. RESULTS The accuracy and completeness of responses were high in both AI models. The median accuracy score was 9 (interquartile range [IQR]: 8-9) for ChatGPT and 8 (IQR: 8-9) for Google Bard (Median difference: 1; P <0.001). The median completeness score was similar in both models, with 8 (IQR: 8-9) for ChatGPT and 8 (IQR: 7-9) for Google Bard. The odds of accuracy and completeness were higher by 31% and 23% in ChatGPT than in Google Bard. Google Bard's response generation time was significantly shorter than that of ChatGPT by 10.4 second/question. However, both models were similar in terms of response length generation. CONCLUSIONS Both ChatGPT and Google Bard generated responses were rated with a high level of accuracy and completeness to the posed general orthodontic questions. However, acquiring answers was generally faster using the Google Bard model.
Collapse
Affiliation(s)
- Baraa Daraqel
- Department of Orthodontics, Stomatological Hospital of Chongqing Medical University Chongqing Key Laboratory of Oral Disease and Biomedical Sciences Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China; Oral Health Research and Promotion Unit, Al-Quds University, Jerusalem, Palestine.
| | - Khaled Wafaie
- Department of Orthodontics, Faculty of Dentistry, First Affiliated Hospital of Zhengzhou University, Zhengzhou, Henan, China
| | | | - Li Cao
- Department of Orthodontics, Stomatological Hospital of Chongqing Medical University Chongqing Key Laboratory of Oral Disease and Biomedical Sciences Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | | | - Yang Liu
- Department of Orthodontics, Stomatological Hospital of Chongqing Medical University Chongqing Key Laboratory of Oral Disease and Biomedical Sciences Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - Leilei Zheng
- Department of Orthodontics, Stomatological Hospital of Chongqing Medical University Chongqing Key Laboratory of Oral Disease and Biomedical Sciences Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China.
| |
Collapse
|
4
|
Moll M, Heilemann G, Georg D, Kauer-Dorner D, Kuess P. The role of artificial intelligence in informed patient consent for radiotherapy treatments-a case report. Strahlenther Onkol 2024; 200:544-548. [PMID: 38180493 DOI: 10.1007/s00066-023-02190-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 12/03/2023] [Indexed: 01/06/2024]
Abstract
Recent advancements in large language models (LMM; e.g., ChatGPT (OpenAI, San Francisco, California, USA)) have seen widespread use in various fields, including healthcare. This case study reports on the first use of LMM in a pretreatment discussion and in obtaining informed consent for a radiation oncology treatment. Further, the reproducibility of the replies by ChatGPT 3.5 was analyzed. A breast cancer patient, following legal consultation, engaged in a conversation with ChatGPT 3.5 regarding her radiotherapy treatment. The patient posed questions about side effects, prevention, activities, medications, and late effects. While some answers contained inaccuracies, responses closely resembled doctors' replies. In a final evaluation discussion, the patient, however, stated that she preferred the presence of a physician and expressed concerns about the source of the provided information. The reproducibility was tested in ten iterations. Future guidelines for using such models in radiation oncology should be driven by medical professionals. While artificial intelligence (AI) supports essential tasks, human interaction remains crucial.
Collapse
Affiliation(s)
- M Moll
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria.
| | - G Heilemann
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria
| | - Dietmar Georg
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria
| | - D Kauer-Dorner
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria
| | - P Kuess
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria
| |
Collapse
|
5
|
Patil NS, Huang R, Mihalache A, Kisilevsky E, Kwok J, Popovic MM, Nassrallah G, Chan C, Mallipatna A, Kertes PJ, Muni RH. THE ABILITY OF ARTIFICIAL INTELLIGENCE CHATBOTS ChatGPT AND GOOGLE BARD TO ACCURATELY CONVEY PREOPERATIVE INFORMATION FOR PATIENTS UNDERGOING OPHTHALMIC SURGERIES. Retina 2024; 44:950-953. [PMID: 38215455 DOI: 10.1097/iae.0000000000004044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2024]
Abstract
INTRODUCTION To determine whether the two popular artificial intelligence chatbots, ChatGPT and Bard, can provide high-quality information concerning procedure description, risks, benefits, and alternatives of various ophthalmic surgeries. METHODS ChatGPT and Bard were prompted with questions pertaining to the description, potential risks, benefits, alternatives, and implications of not proceeding with various surgeries in different subspecialties of ophthalmology. Six common ophthalmic procedures were included in the authors' analysis. Two comprehensive ophthalmologists and one subspecialist graded each response independently using a 5-point Likert scale. RESULTS Likert grading for accuracy was significantly higher for ChatGPT in comparison with Bard (4.5 ± 0.6 vs. 3.8 ± 0.8, P < 0.0001). Generally, ChatGPT performed better than Bard even when questions were stratified by the type of ophthalmic surgery. There was no significant difference between ChatGPT and Bard for response length (2,104.7 ± 271.4 characters vs. 2,441.0 ± 633.9 characters, P = 0.12). ChatGPT responded significantly slower than Bard (46.0 ± 3.0 vs. 6.6 ± 1.2 seconds, P < 0.0001). CONCLUSION Both ChatGPT and Bard may offer accessible and high-quality information relevant to the informed consent process for various ophthalmic procedures. Nonetheless, both artificial intelligence chatbots overlooked the probability of adverse events, hence limiting their potential and introducing patients to information that may be difficult to interpret.
Collapse
Affiliation(s)
- Nikhil S Patil
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Ryan Huang
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Andrew Mihalache
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Eli Kisilevsky
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- Unity Health, St. Joseph's Health Centre, University of Toronto, Toronto, Ontario, Canada
| | - Jason Kwok
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Marko M Popovic
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Georges Nassrallah
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology, Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada
| | - Clara Chan
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Ashwin Mallipatna
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology, Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada
| | - Peter J Kertes
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- John and Liz Tory Eye Centre, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada; and
| | - Rajeev H Muni
- Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada
| |
Collapse
|
6
|
Protocol for the development of the Chatbot Assessment Reporting Tool (CHART) for clinical advice. BMJ Open 2024; 14:e081155. [PMID: 38772889 PMCID: PMC11110548 DOI: 10.1136/bmjopen-2023-081155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 03/26/2024] [Indexed: 05/23/2024] Open
Abstract
INTRODUCTION Large language model (LLM)-linked chatbots are being increasingly applied in healthcare due to their impressive functionality and public availability. Studies have assessed the ability of LLM-linked chatbots to provide accurate clinical advice. However, the methods applied in these Chatbot Assessment Studies are inconsistent due to the lack of reporting standards available, which obscures the interpretation of their study findings. This protocol outlines the development of the Chatbot Assessment Reporting Tool (CHART) reporting guideline. METHODS AND ANALYSIS The development of the CHART reporting guideline will consist of three phases, led by the Steering Committee. During phase one, the team will identify relevant reporting guidelines with artificial intelligence extensions that are published or in development by searching preprint servers, protocol databases, and the Enhancing the Quality and Transparency of health research Network. During phase two, we will conduct a scoping review to identify studies that have addressed the performance of LLM-linked chatbots in summarising evidence and providing clinical advice. The Steering Committee will identify methodology used in previous Chatbot Assessment Studies. Finally, the study team will use checklist items from prior reporting guidelines and findings from the scoping review to develop a draft reporting checklist. We will then perform a Delphi consensus and host two synchronous consensus meetings with an international, multidisciplinary group of stakeholders to refine reporting checklist items and develop a flow diagram. ETHICS AND DISSEMINATION We will publish the final CHART reporting guideline in peer-reviewed journals and will present findings at peer-reviewed meetings. Ethical approval was submitted to the Hamilton Integrated Research Ethics Board and deemed "not required" in accordance with the Tri-Council Policy Statement (TCPS2) for the development of the CHART reporting guideline (#17025). REGISTRATION This study protocol is preregistered with Open Science Framework: https://doi.org/10.17605/OSF.IO/59E2Q.
Collapse
|
7
|
Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Forte AJ. AI in Hand Surgery: Assessing Large Language Models in the Classification and Management of Hand Injuries. J Clin Med 2024; 13:2832. [PMID: 38792374 PMCID: PMC11122623 DOI: 10.3390/jcm13102832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 04/29/2024] [Accepted: 05/09/2024] [Indexed: 05/26/2024] Open
Abstract
Background: OpenAI's ChatGPT (San Francisco, CA, USA) and Google's Gemini (Mountain View, CA, USA) are two large language models that show promise in improving and expediting medical decision making in hand surgery. Evaluating the applications of these models within the field of hand surgery is warranted. This study aims to evaluate ChatGPT-4 and Gemini in classifying hand injuries and recommending treatment. Methods: Gemini and ChatGPT were given 68 fictionalized clinical vignettes of hand injuries twice. The models were asked to use a specific classification system and recommend surgical or nonsurgical treatment. Classifications were scored based on correctness. Results were analyzed using descriptive statistics, a paired two-tailed t-test, and sensitivity testing. Results: Gemini, correctly classifying 70.6% hand injuries, demonstrated superior classification ability over ChatGPT (mean score 1.46 vs. 0.87, p-value < 0.001). For management, ChatGPT demonstrated higher sensitivity in recommending surgical intervention compared to Gemini (98.0% vs. 88.8%), but lower specificity (68.4% vs. 94.7%). When compared to ChatGPT, Gemini demonstrated greater response replicability. Conclusions: Large language models like ChatGPT and Gemini show promise in assisting medical decision making, particularly in hand surgery, with Gemini generally outperforming ChatGPT. These findings emphasize the importance of considering the strengths and limitations of different models when integrating them into clinical practice.
Collapse
Affiliation(s)
| | - Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Syed Ali Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Antonio Jorge Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
8
|
Jedrzejczak WW, Skarzynski PH, Raj-Koziak D, Sanfins MD, Hatzopoulos S, Kochanek K. ChatGPT for Tinnitus Information and Support: Response Accuracy and Retest after Three and Six Months. Brain Sci 2024; 14:465. [PMID: 38790444 PMCID: PMC11118795 DOI: 10.3390/brainsci14050465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Revised: 05/03/2024] [Accepted: 05/05/2024] [Indexed: 05/26/2024] Open
Abstract
Testing of ChatGPT has recently been performed over a diverse range of topics. However, most of these assessments have been based on broad domains of knowledge. Here, we test ChatGPT's knowledge of tinnitus, an important but specialized aspect of audiology and otolaryngology. Testing involved evaluating ChatGPT's answers to a defined set of 10 questions on tinnitus. Furthermore, given the technology is advancing quickly, we re-evaluated the responses to the same 10 questions 3 and 6 months later. The accuracy of the responses was rated by 6 experts (the authors) using a Likert scale ranging from 1 to 5. Most of ChatGPT's responses were rated as satisfactory or better. However, we did detect a few instances where the responses were not accurate and might be considered somewhat misleading. Over the first 3 months, the ratings generally improved, but there was no more significant improvement at 6 months. In our judgment, ChatGPT provided unexpectedly good responses, given that the questions were quite specific. Although no potentially harmful errors were identified, some mistakes could be seen as somewhat misleading. ChatGPT shows great potential if further developed by experts in specific areas, but for now, it is not yet ready for serious application.
Collapse
Affiliation(s)
- W. Wiktor Jedrzejczak
- Department of Experimental Audiology, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland;
| | - Piotr H. Skarzynski
- Department of Teleaudiology and Screening, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland; (P.H.S.); (M.D.S.)
- Institute of Sensory Organs, 05-830 Kajetany, Poland
- Heart Failure and Cardiac Rehabilitation Department, Faculty of Medicine, Medical University of Warsaw, 03-242 Warsaw, Poland
| | - Danuta Raj-Koziak
- Tinnitus Department, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland;
| | - Milaine Dominici Sanfins
- Department of Teleaudiology and Screening, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland; (P.H.S.); (M.D.S.)
- Speech-Hearing-Language Department, Audiology Discipline, Universidade Federal de São Paulo, São Paulo 04023062, Brazil
| | - Stavros Hatzopoulos
- ENT and Audiology Unit, Department of Neurosciences and Rehabilitation, University of Ferrara, 44121 Ferrara, Italy;
| | - Krzysztof Kochanek
- Department of Experimental Audiology, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland;
| |
Collapse
|
9
|
Tu W, Joe BN. The Era of ChatGPT and Large Language Models: Can We Advance Patient-centered Communications Appropriately and Safely? Radiol Imaging Cancer 2024; 6:e240038. [PMID: 38668641 PMCID: PMC11148828 DOI: 10.1148/rycan.240038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 03/20/2024] [Accepted: 03/21/2024] [Indexed: 05/18/2024]
Affiliation(s)
- Wendy Tu
- From the Department of Medical Imaging, University of Alberta, 116th
St & 85th Ave, Edmonton, AB, Canada T6G 2R3; and Department of Radiology
and Biomedical Imaging, University of California at San Francisco, San
Francisco, Calif
| | - Bonnie N. Joe
- From the Department of Medical Imaging, University of Alberta, 116th
St & 85th Ave, Edmonton, AB, Canada T6G 2R3; and Department of Radiology
and Biomedical Imaging, University of California at San Francisco, San
Francisco, Calif
| |
Collapse
|
10
|
Cesur T, Güneş YC. Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases. Cureus 2024; 16:e60009. [PMID: 38854352 PMCID: PMC11162509 DOI: 10.7759/cureus.60009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/09/2024] [Indexed: 06/11/2024] Open
Abstract
Background Recent studies have highlighted the diagnostic performance of ChatGPT 3.5 and GPT-4 in a text-based format, demonstrating their radiological knowledge across different areas. Our objective is to investigate the impact of prompt engineering on the diagnostic performance of ChatGPT 3.5 and GPT-4 in diagnosing thoracic radiology cases, highlighting how the complexity of prompts influences model performance. Methodology We conducted a retrospective cross-sectional study using 124 publicly available Case of the Month examples from the Thoracic Society of Radiology website. We initially input the cases into the ChatGPT versions without prompting. Then, we employed five different prompts, ranging from basic task-oriented to complex role-specific formulations to measure the diagnostic accuracy of ChatGPT versions. The differential diagnosis lists generated by the models were compared against the radiological diagnoses listed on the Thoracic Society of Radiology website, with a scoring system in place to comprehensively assess the accuracy. Diagnostic accuracy and differential diagnosis scores were analyzed using the McNemar, Chi-square, Kruskal-Wallis, and Mann-Whitney U tests. Results Without any prompts, ChatGPT 3.5's accuracy was 25% (31/124), which increased to 56.5% (70/124) with the most complex prompt (P < 0.001). GPT-4 showed a high baseline accuracy at 53.2% (66/124) without prompting. This accuracy increased to 59.7% (74/124) with complex prompts (P = 0.09). Notably, there was no statistical difference in peak performance between ChatGPT 3.5 (70/124) and GPT-4 (74/124) (P = 0.55). Conclusions This study emphasizes the critical influence of prompt engineering on enhancing the diagnostic performance of ChatGPT versions, especially ChatGPT 3.5.
Collapse
Affiliation(s)
- Turay Cesur
- Radiology, Ankara Mamak State Hospital, Ankara, TUR
| | | |
Collapse
|
11
|
Niko MM, Karbasi Z, Kazemi M, Zahmatkeshan M. Comparing ChatGPT and Bing, in response to the Home Blood Pressure Monitoring (HBPM) knowledge checklist. Hypertens Res 2024; 47:1401-1409. [PMID: 38438722 DOI: 10.1038/s41440-024-01624-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 01/23/2024] [Accepted: 01/27/2024] [Indexed: 03/06/2024]
Abstract
High blood pressure is one of the major public health problems that is prevalent worldwide. Due to the rapid increase in the number of users of artificial intelligence tools such as ChatGPT and Bing, it is expected that patients will use these tools as a source of information to obtain information about high blood pressure. The purpose of this study is to check the accuracy, completeness, and reproducibility of answers provided by ChatGPT and Bing to the knowledge questionnaire of blood pressure control at home. In this study, ChatGPT and Bing's responses to the HBPM 10-question knowledge checklist on blood pressure measurement were independently reviewed by three cardiologists. The mean accuracy rating of ChatGPT was 5.96 (SD = 0.17) indicating the responses were highly accurate overall, with the vast majority receiving the top score. The mean accuracy and completeness of ChatGPT were 5.96 (SD = 0.17) and 2.93 (SD = 0.25) and in Bing were 5.31 (SD = 0.67), and 2.13 (SD = 0.53) Respectively. Due to the expansion of artificial intelligence applications, patients can use new tools such as ChatGPT and Bing to search for information and at the same time can trust the information obtained. we found that the answers obtained from ChatGPT are reliable and valuable for patients, while Bing is also considered a powerful tool, it has more limitations than ChatGPT, and the answers should be interpreted with caution.
Collapse
Affiliation(s)
| | - Zahra Karbasi
- Department of Health Information Sciences, Faculty of Management and Medical Information Sciences, Kerman University of Medical Sciences, Kerman, Iran
| | - Maryam Kazemi
- Noncommunicable Diseases Research Center, Fasa University of Medical Sciences, Fasa, Iran
| | - Maryam Zahmatkeshan
- Noncommunicable Diseases Research Center, Fasa University of Medical Sciences, Fasa, Iran.
- School of Allied Medical Sciences, Fasa University of Medical Sciences, Fasa, Iran.
| |
Collapse
|
12
|
Al-Sharif EM, Penteado RC, Dib El Jalbout N, Topilow NJ, Shoji MK, Kikkawa DO, Liu CY, Korn BS. Evaluating the Accuracy of ChatGPT and Google BARD in Fielding Oculoplastic Patient Queries: A Comparative Study on Artificial versus Human Intelligence. Ophthalmic Plast Reconstr Surg 2024; 40:303-311. [PMID: 38215452 DOI: 10.1097/iop.0000000000002567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2024]
Abstract
PURPOSE This study evaluates and compares the accuracy of responses from 2 artificial intelligence platforms to patients' oculoplastics-related questions. METHODS Questions directed toward oculoplastic surgeons were collected, rephrased, and input independently into ChatGPT-3.5 and BARD chatbots, using the prompt: "As an oculoplastic surgeon, how can I respond to my patient's question?." Responses were independently evaluated by 4 experienced oculoplastic specialists as comprehensive, correct but inadequate, mixed correct and incorrect/outdated data, and completely incorrect. Additionally, the empathy level, length, and automated readability index of the responses were assessed. RESULTS A total of 112 patient questions underwent evaluation. The rates of comprehensive, correct but inadequate, mixed, and completely incorrect answers for ChatGPT were 71.4%, 12.9%, 10.5%, and 5.1%, respectively, compared with 53.1%, 18.3%, 18.1%, and 10.5%, respectively, for BARD. ChatGPT showed more empathy (48.9%) than BARD (13.2%). All graders found that ChatGPT outperformed BARD in question categories of postoperative healing, medical eye conditions, and medications. Categorizing questions by anatomy, ChatGPT excelled in answering lacrimal questions (83.8%), while BARD performed best in the eyelid group (60.4%). ChatGPT's answers were longer and potentially more challenging to comprehend than BARD's. CONCLUSION This study emphasizes the promising role of artificial intelligence-powered chatbots in oculoplastic patient education and support. With continued development, these chatbots may potentially assist physicians and offer patients accurate information, ultimately contributing to improved patient care while alleviating surgeon burnout. However, it is crucial to highlight that artificial intelligence may be good at answering questions, but physician oversight remains essential to ensure the highest standard of care and address complex medical cases.
Collapse
Affiliation(s)
- Eman M Al-Sharif
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Clinical Sciences Department, College of Medicine, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Rafaella C Penteado
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Nahia Dib El Jalbout
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Nicole J Topilow
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Marissa K Shoji
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Don O Kikkawa
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Division of Plastic and Reconstructive Surgery, Department of Surgery, UC San Diego School of Medicine, La Jolla, California, U.S.A
| | - Catherine Y Liu
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Bobby S Korn
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Division of Plastic and Reconstructive Surgery, Department of Surgery, UC San Diego School of Medicine, La Jolla, California, U.S.A
| |
Collapse
|
13
|
Yang J, Ardavanis KS, Slack KE, Fernando ND, Della Valle CJ, Hernandez NM. Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis. J Arthroplasty 2024; 39:1184-1190. [PMID: 38237878 DOI: 10.1016/j.arth.2024.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/08/2024] [Accepted: 01/11/2024] [Indexed: 02/22/2024] Open
Abstract
BACKGROUND Advancements in artificial intelligence (AI) have led to the creation of large language models (LLMs), such as Chat Generative Pretrained Transformer (ChatGPT) and Bard, that analyze online resources to synthesize responses to user queries. Despite their popularity, the accuracy of LLM responses to medical questions remains unknown. This study aimed to compare the responses of ChatGPT and Bard regarding treatments for hip and knee osteoarthritis with the American Academy of Orthopaedic Surgeons (AAOS) Evidence-Based Clinical Practice Guidelines (CPGs) recommendations. METHODS Both ChatGPT (Open AI) and Bard (Google) were queried regarding 20 treatments (10 for hip and 10 for knee osteoarthritis) from the AAOS CPGs. Responses were classified by 2 reviewers as being in "Concordance," "Discordance," or "No Concordance" with AAOS CPGs. A Cohen's Kappa coefficient was used to assess inter-rater reliability, and Chi-squared analyses were used to compare responses between LLMs. RESULTS Overall, ChatGPT and Bard provided responses that were concordant with the AAOS CPGs for 16 (80%) and 12 (60%) treatments, respectively. Notably, ChatGPT and Bard encouraged the use of non-recommended treatments in 30% and 60% of queries, respectively. There were no differences in performance when evaluating by joint or by recommended versus non-recommended treatments. Studies were referenced in 6 (30%) of the Bard responses and none (0%) of the ChatGPT responses. Of the 6 Bard responses, studies could only be identified for 1 (16.7%). Of the remaining, 2 (33.3%) responses cited studies in journals that did not exist, 2 (33.3%) cited studies that could not be found with the information given, and 1 (16.7%) provided links to unrelated studies. CONCLUSIONS Both ChatGPT and Bard do not consistently provide responses that align with the AAOS CPGs. Consequently, physicians and patients should temper expectations on the guidance AI platforms can currently provide.
Collapse
Affiliation(s)
- JaeWon Yang
- Department of Orthopaedic Surgery, University of Washington, Seattle, Washington
| | - Kyle S Ardavanis
- Department of Orthopaedic Surgery, Madigan Medical Center, Tacoma, Washington
| | - Katherine E Slack
- Elson S. Floyd College of Medicine, Washington State University, Spokane, Washington
| | - Navin D Fernando
- Department of Orthopaedic Surgery, University of Washington, Seattle, Washington
| | - Craig J Della Valle
- Department of Orthopaedic Surgery, Rush University Medical Center, Chicago, Illinois
| | - Nicholas M Hernandez
- Department of Orthopaedic Surgery, University of Washington, Seattle, Washington
| |
Collapse
|
14
|
Schlussel L, Samaan JS, Chan Y, Chang B, Yeo YH, Ng WH, Rezaie A. Evaluating the accuracy and reproducibility of ChatGPT-4 in answering patient questions related to small intestinal bacterial overgrowth. Artif Intell Gastroenterol 2024; 5:90503. [DOI: 10.35712/aig.v5.i1.90503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 03/27/2024] [Accepted: 04/16/2024] [Indexed: 04/29/2024] Open
Abstract
BACKGROUND Small intestinal bacterial overgrowth (SIBO) poses diagnostic and treatment challenges due to its complex management and evolving guidelines. Patients often seek online information related to their health, prompting interest in large language models, like GPT-4, as potential sources of patient education.
AIM To investigate ChatGPT-4's accuracy and reproducibility in responding to patient questions related to SIBO.
METHODS A total of 27 patient questions related to SIBO were curated from professional societies, Facebook groups, and Reddit threads. Each question was entered into GPT-4 twice on separate days to examine reproducibility of accuracy on separate occasions. GPT-4 generated responses were independently evaluated for accuracy and reproducibility by two motility fellowship-trained gastroenterologists. A third senior fellowship-trained gastroenterologist resolved disagreements. Accuracy of responses were graded using the scale: (1) Comprehensive; (2) Correct but inadequate; (3) Some correct and some incorrect; or (4) Completely incorrect. Two responses were generated for every question to evaluate reproducibility in accuracy.
RESULTS In evaluating GPT-4's effectiveness at answering SIBO-related questions, it provided responses with correct information to 18/27 (66.7%) of questions, with 16/27 (59.3%) of responses graded as comprehensive and 2/27 (7.4%) responses graded as correct but inadequate. The model provided responses with incorrect information to 9/27 (33.3%) of questions, with 4/27 (14.8%) of responses graded as completely incorrect and 5/27 (18.5%) of responses graded as mixed correct and incorrect data. Accuracy varied by question category, with questions related to “basic knowledge” achieving the highest proportion of comprehensive responses (90%) and no incorrect responses. On the other hand, the “treatment” related questions yielded the lowest proportion of comprehensive responses (33.3%) and highest percent of completely incorrect responses (33.3%). A total of 77.8% of questions yielded reproducible responses.
CONCLUSION Though GPT-4 shows promise as a supplementary tool for SIBO-related patient education, the model requires further refinement and validation in subsequent iterations prior to its integration into patient care.
Collapse
Affiliation(s)
- Lauren Schlussel
- Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Jamil S Samaan
- Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Yin Chan
- Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Bianca Chang
- Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Yee Hui Yeo
- Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Wee Han Ng
- Bristol Medical School, University of Bristol, BS8 1TH, Bristol, United Kingdom
| | - Ali Rezaie
- Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
- Medically Associated Science and Technology Program, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| |
Collapse
|
15
|
Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, Hassani C, Raman SS, Bedayat A. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024:S2211-5684(24)00105-0. [PMID: 38679540 DOI: 10.1016/j.diii.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/11/2024] [Accepted: 04/16/2024] [Indexed: 05/01/2024]
Abstract
PURPOSE The purpose of this study was to systematically review the reported performances of ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, and ethical considerations in radiology applications. MATERIALS AND METHODS After a comprehensive review of PubMed, Web of Science, Embase, and Google Scholar databases, a cohort of published studies was identified up to January 1, 2024, utilizing ChatGPT for clinical radiology applications. RESULTS Out of 861 studies derived, 44 studies evaluated the performance of ChatGPT; among these, 37 (37/44; 84.1%) demonstrated high performance, and seven (7/44; 15.9%) indicated it had a lower performance in providing information on diagnosis and clinical decision support (6/44; 13.6%) and patient communication and educational content (1/44; 2.3%). Twenty-four (24/44; 54.5%) studies reported the proportion of ChatGPT's performance. Among these, 19 (19/24; 79.2%) studies recorded a median accuracy of 70.5%, and in five (5/24; 20.8%) studies, there was a median agreement of 83.6% between ChatGPT outcomes and reference standards [radiologists' decision or guidelines], generally confirming ChatGPT's high accuracy in these studies. Eleven studies compared two recent ChatGPT versions, and in ten (10/11; 90.9%), ChatGPTv4 outperformed v3.5, showing notable enhancements in addressing higher-order thinking questions, better comprehension of radiology terms, and improved accuracy in describing images. Risks and concerns about using ChatGPT included biased responses, limited originality, and the potential for inaccurate information leading to misinformation, hallucinations, improper citations and fake references, cybersecurity vulnerabilities, and patient privacy risks. CONCLUSION Although ChatGPT's effectiveness has been shown in 84.1% of radiology studies, there are still multiple pitfalls and limitations to address. It is too soon to confirm its complete proficiency and accuracy, and more extensive multicenter studies utilizing diverse datasets and pre-training techniques are required to verify ChatGPT's role in radiology.
Collapse
Affiliation(s)
- Pedram Keshavarz
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; School of Science and Technology, The University of Georgia, Tbilisi 0171, Georgia
| | - Sara Bagherieh
- Independent Clinical Radiology Researcher, Los Angeles, CA 90024, USA
| | | | - Hamid Chalian
- Department of Radiology, Cardiothoracic Imaging, University of Washington, Seattle, WA 98195, USA
| | - Amir Ali Rahsepar
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Grace Hyun J Kim
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; Department of Radiological Sciences, Center for Computer Vision and Imaging Biomarkers, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Cameron Hassani
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Steven S Raman
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Arash Bedayat
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA.
| |
Collapse
|
16
|
Lv X, Zhang X, Li Y, Ding X, Lai H, Shi J. Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content. J Med Internet Res 2024; 26:e55847. [PMID: 38663010 PMCID: PMC11082737 DOI: 10.2196/55847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 03/04/2024] [Accepted: 03/19/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND While large language models (LLMs) such as ChatGPT and Google Bard have shown significant promise in various fields, their broader impact on enhancing patient health care access and quality, particularly in specialized domains such as oral health, requires comprehensive evaluation. OBJECTIVE This study aims to assess the effectiveness of Google Bard, ChatGPT-3.5, and ChatGPT-4 in offering recommendations for common oral health issues, benchmarked against responses from human dental experts. METHODS This comparative analysis used 40 questions derived from patient surveys on prevalent oral diseases, which were executed in a simulated clinical environment. Responses, obtained from both human experts and LLMs, were subject to a blinded evaluation process by experienced dentists and lay users, focusing on readability, appropriateness, harmlessness, comprehensiveness, intent capture, and helpfulness. Additionally, the stability of artificial intelligence responses was also assessed by submitting each question 3 times under consistent conditions. RESULTS Google Bard excelled in readability but lagged in appropriateness when compared to human experts (mean 8.51, SD 0.37 vs mean 9.60, SD 0.33; P=.03). ChatGPT-3.5 and ChatGPT-4, however, performed comparably with human experts in terms of appropriateness (mean 8.96, SD 0.35 and mean 9.34, SD 0.47, respectively), with ChatGPT-4 demonstrating the highest stability and reliability. Furthermore, all 3 LLMs received superior harmlessness scores comparable to human experts, with lay users finding minimal differences in helpfulness and intent capture between the artificial intelligence models and human responses. CONCLUSIONS LLMs, particularly ChatGPT-4, show potential in oral health care, providing patient-centric information for enhancing patient education and clinical care. The observed performance variations underscore the need for ongoing refinement and ethical considerations in health care settings. Future research focuses on developing strategies for the safe integration of LLMs in health care settings.
Collapse
Affiliation(s)
- Xiaolei Lv
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Xiaomeng Zhang
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Yuan Li
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Xinxin Ding
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Hongchang Lai
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Junyu Shi
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| |
Collapse
|
17
|
Bhayana R, Biswas S, Cook TS, Kim W, Kitamura FC, Gichoya J, Yi PH. From Bench to Bedside With Large Language Models: AJR Expert Panel Narrative Review. AJR Am J Roentgenol 2024. [PMID: 38598354 DOI: 10.2214/ajr.24.30928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Large language models (LLMs) hold immense potential to revolutionize radiology. However, their integration into practice requires careful consideration. Artificial intelligence (AI) chatbots and general-purpose LLMs have potential pitfalls related to privacy, transparency, and accuracy, limiting their current clinical readiness. Thus, LLM-based tools must be optimized for radiology practice to overcome these limitations. While research and validation for radiology applications remain in their infancy, commercial products incorporating LLMs are becoming available alongside promises of transforming practice. To help radiologists navigate this landscape, this AJR Expert Panel Narrative Review provides a multidimensional perspective on LLMs, encompassing considerations from bench (development and optimization) to bedside (use in practice). At present, LLMs are not autonomous entities that can replace expert decision-making, and radiologists remain responsible for the content of their reports. Patient-facing tools, particularly medical AI chatbots, require additional guardrails to ensure safety and prevent misuse. Still, if responsibly implemented, LLMs are well-positioned to transform efficiency and quality in radiology. Radiologists must be well-informed and proactively involved in guiding the implementation of LLMs in practice to mitigate risks and maximize benefits to patient care.
Collapse
Affiliation(s)
- Rajesh Bhayana
- University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, University of Toronto, Toronto, ON, Canada
| | - Som Biswas
- Department of Radiology, Le Bonheur Children's Hospital, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Tessa S Cook
- Department of Radiology, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - Woojin Kim
- Department of Radiology, Palo Alto VA Medical Center, Palo Alto, CA
| | - Felipe C Kitamura
- Department of Diagnostic Imaging, Universidade Federal de São Paulo, São Paulo, Brazil
- Dasa, São Paulo, Brazil
| | - Judy Gichoya
- Department of Radiology, Emory University School of Medicine, Georgia, U.S.A
| | - Paul H Yi
- Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, Baltimore, MD
| |
Collapse
|
18
|
Alanezi F. Examining the role of ChatGPT in promoting health behaviors and lifestyle changes among cancer patients. Nutr Health 2024:2601060241244563. [PMID: 38567408 DOI: 10.1177/02601060241244563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Purpose: This study aims to investigate the role of ChatGPT in promoting health behavioral changes among cancer patients. Methods: A quasi-experiment design with qualitative approach was adopted in this study, as the ChatGPT technology is novel, and many people are unaware of it. The participants included outpatients at a public hospital. An experiment was carried out, where the participants used ChatGPT for seeking cancer related information for two weeks, which is then followed by focus group (FG) discussions. A total of 72 outpatients participated in ten focus groups. Results: Three main themes with 14 sub-themes were identified reflecting the role of ChatGPT in promoting health behavior changes. Its prominent role was observed in developing health literacy, promoting self-management of conditions through emotional, informational, motivational support. Three challenges including privacy, lack of personalization, and reliability issues were identified. Conclusion: Although ChatGPT has a huge potential in promoting health behavior changes among cancer patients, its ability is minimized by several factors such as regulatory, reliability, and privacy issues. There is a need for further evidence to generalize the results across the regions.
Collapse
Affiliation(s)
- Fahad Alanezi
- College of Business Administration, Department Management Information Systems, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia
| |
Collapse
|
19
|
Lehnen NC, Dorn F, Wiest IC, Zimmermann H, Radbruch A, Kather JN, Paech D. Data Extraction from Free-Text Reports on Mechanical Thrombectomy in Acute Ischemic Stroke Using ChatGPT: A Retrospective Analysis. Radiology 2024; 311:e232741. [PMID: 38625006 DOI: 10.1148/radiol.232741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
Background Procedural details of mechanical thrombectomy in patients with ischemic stroke are important predictors of clinical outcome and are collected for prospective studies or national stroke registries. To date, these data are collected manually by human readers, a labor-intensive task that is prone to errors. Purpose To evaluate the use of the large language models (LLMs) GPT-4 and GPT-3.5 to extract data from neuroradiology reports on mechanical thrombectomy in patients with ischemic stroke. Materials and Methods This retrospective study included consecutive reports from patients with ischemic stroke who underwent mechanical thrombectomy between November 2022 and September 2023 at institution 1 and between September 2016 and December 2019 at institution 2. A set of 20 reports was used to optimize the prompt, and the ability of the LLMs to extract procedural data from the reports was compared using the McNemar test. Data manually extracted by an interventional neuroradiologist served as the reference standard. Results A total of 100 internal reports from 100 patients (mean age, 74.7 years ± 13.2 [SD]; 53 female) and 30 external reports from 30 patients (mean age, 72.7 years ± 13.5; 18 male) were included. All reports were successfully processed by GPT-4 and GPT-3.5. Of 2800 data entries, 2631 (94.0% [95% CI: 93.0, 94.8]; range per category, 61%-100%) data points were correctly extracted by GPT-4 without the need for further postprocessing. With 1788 of 2800 correct data entries, GPT-3.5 produced fewer correct data entries than did GPT-4 (63.9% [95% CI: 62.0, 65.6]; range per category, 14%-99%; P < .001). For the external reports, GPT-4 extracted 760 of 840 (90.5% [95% CI: 88.3, 92.4]) correct data entries, while GPT-3.5 extracted 539 of 840 (64.2% [95% CI: 60.8, 67.4]; P < .001). Conclusion Compared with GPT-3.5, GPT-4 more frequently extracted correct procedural data from free-text reports on mechanical thrombectomy performed in patients with ischemic stroke. © RSNA, 2024 Supplemental material is available for this article.
Collapse
Affiliation(s)
- Nils C Lehnen
- From the Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127 Bonn, Germany (N.C.L., F.D., A.R., D.P.); Research Group Clinical Neuroimaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany (N.C.L., A.R.); Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany (I.C.W.); Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany (I.C.W., J.N.K.); Institute of Neuroradiology, University Hospital, LMU Munich, Munich, Germany (H.Z.); and Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (D.P.)
| | - Franziska Dorn
- From the Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127 Bonn, Germany (N.C.L., F.D., A.R., D.P.); Research Group Clinical Neuroimaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany (N.C.L., A.R.); Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany (I.C.W.); Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany (I.C.W., J.N.K.); Institute of Neuroradiology, University Hospital, LMU Munich, Munich, Germany (H.Z.); and Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (D.P.)
| | - Isabella C Wiest
- From the Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127 Bonn, Germany (N.C.L., F.D., A.R., D.P.); Research Group Clinical Neuroimaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany (N.C.L., A.R.); Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany (I.C.W.); Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany (I.C.W., J.N.K.); Institute of Neuroradiology, University Hospital, LMU Munich, Munich, Germany (H.Z.); and Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (D.P.)
| | - Hanna Zimmermann
- From the Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127 Bonn, Germany (N.C.L., F.D., A.R., D.P.); Research Group Clinical Neuroimaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany (N.C.L., A.R.); Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany (I.C.W.); Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany (I.C.W., J.N.K.); Institute of Neuroradiology, University Hospital, LMU Munich, Munich, Germany (H.Z.); and Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (D.P.)
| | - Alexander Radbruch
- From the Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127 Bonn, Germany (N.C.L., F.D., A.R., D.P.); Research Group Clinical Neuroimaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany (N.C.L., A.R.); Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany (I.C.W.); Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany (I.C.W., J.N.K.); Institute of Neuroradiology, University Hospital, LMU Munich, Munich, Germany (H.Z.); and Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (D.P.)
| | - Jakob Nikolas Kather
- From the Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127 Bonn, Germany (N.C.L., F.D., A.R., D.P.); Research Group Clinical Neuroimaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany (N.C.L., A.R.); Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany (I.C.W.); Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany (I.C.W., J.N.K.); Institute of Neuroradiology, University Hospital, LMU Munich, Munich, Germany (H.Z.); and Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (D.P.)
| | - Daniel Paech
- From the Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127 Bonn, Germany (N.C.L., F.D., A.R., D.P.); Research Group Clinical Neuroimaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany (N.C.L., A.R.); Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany (I.C.W.); Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany (I.C.W., J.N.K.); Institute of Neuroradiology, University Hospital, LMU Munich, Munich, Germany (H.Z.); and Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (D.P.)
| |
Collapse
|
20
|
Kim H, Kim P, Joo I, Kim JH, Park CM, Yoon SH. ChatGPT Vision for Radiological Interpretation: An Investigation Using Medical School Radiology Examinations. Korean J Radiol 2024; 25:403-406. [PMID: 38528699 PMCID: PMC10973733 DOI: 10.3348/kjr.2024.0017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 01/11/2024] [Accepted: 01/14/2024] [Indexed: 03/27/2024] Open
Affiliation(s)
- Hyungjin Kim
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Paul Kim
- Graduate School of Education, Stanford University, Stanford, CA, USA
| | - Ijin Joo
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Jung Hoon Kim
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Chang Min Park
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Soon Ho Yoon
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
21
|
Cozzi A, Pinker K, Hidber A, Zhang T, Bonomo L, Lo Gullo R, Christianson B, Curti M, Rizzo S, Del Grande F, Mann RM, Schiaffino S, Panzer A. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 2024; 311:e232133. [PMID: 38687216 PMCID: PMC11070611 DOI: 10.1148/radiol.232133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 03/08/2024] [Accepted: 03/12/2024] [Indexed: 05/02/2024]
Abstract
Background The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks. Purpose To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management. Materials and Methods This retrospective study included reports for women who underwent MRI, mammography, and/or US for breast cancer screening or diagnostic purposes at three referral centers. Reports with findings categorized as BI-RADS 1-5 and written in Italian, English, or Dutch were collected between January 2000 and October 2023. Board-certified breast radiologists and the LLMs GPT-3.5 and GPT-4 (OpenAI) and Bard, now called Gemini (Google), assigned BI-RADS categories using only the findings described by the original radiologists. Agreement between human readers and LLMs for BI-RADS categories was assessed using the Gwet agreement coefficient (AC1 value). Frequencies were calculated for changes in BI-RADS category assignments that would affect clinical management (ie, BI-RADS 0 vs BI-RADS 1 or 2 vs BI-RADS 3 vs BI-RADS 4 or 5) and compared using the McNemar test. Results Across 2400 reports, agreement between the original and reviewing radiologists was almost perfect (AC1 = 0.91), while agreement between the original radiologists and GPT-4, GPT-3.5, and Bard was moderate (AC1 = 0.52, 0.48, and 0.42, respectively). Across human readers and LLMs, differences were observed in the frequency of BI-RADS category upgrades or downgrades that would result in changed clinical management (118 of 2400 [4.9%] for human readers, 611 of 2400 [25.5%] for Bard, 573 of 2400 [23.9%] for GPT-3.5, and 435 of 2400 [18.1%] for GPT-4; P < .001) and that would negatively impact clinical management (37 of 2400 [1.5%] for human readers, 435 of 2400 [18.1%] for Bard, 344 of 2400 [14.3%] for GPT-3.5, and 255 of 2400 [10.6%] for GPT-4; P < .001). Conclusion LLMs achieved moderate agreement with human reader-assigned BI-RADS categories across reports written in three languages but also yielded a high percentage of discordant BI-RADS categories that would negatively impact clinical management. © RSNA, 2024 Supplemental material is available for this article.
Collapse
Affiliation(s)
| | | | - Andri Hidber
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| | - Tianyu Zhang
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| | - Luca Bonomo
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| | - Roberto Lo Gullo
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| | - Blake Christianson
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| | - Marco Curti
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| | - Stefania Rizzo
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| | - Filippo Del Grande
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| | | | | | - Ariane Panzer
- From the Imaging Institute of Southern Switzerland (IIMSI), Ente
Ospedaliero Cantonale, Via Tesserete 46, 6900 Lugano, Switzerland (A.C., L.B.,
M.C., S.R., F.D.G., S.S.); Breast Imaging Service, Department of Radiology,
Memorial Sloan Kettering Cancer Center, New York, NY (K.P., R.L.G., B.C.);
Faculty of Biomedical Sciences, Università della Svizzera Italiana,
Lugano, Switzerland (A.H., S.R., F.D.G., S.S.); Department of Radiology,
Netherlands Cancer Institute, Amsterdam, the Netherlands (T.Z., R.M.M.);
Department of Diagnostic Imaging, Radboud University Medical Center, Nijmegen,
the Netherlands (T.Z., R.M.M.); and GROW Research Institute for Oncology and
Reproduction, Maastricht University, Maastricht, the Netherlands (T.Z.)
| |
Collapse
|
22
|
Peled T, Sela HY, Weiss A, Grisaru-Granovsky S, Agrawal S, Rottenstreich M. Evaluating the validity of ChatGPT responses on common obstetric issues: Potential clinical applications and implications. Int J Gynaecol Obstet 2024. [PMID: 38523565 DOI: 10.1002/ijgo.15501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 02/29/2024] [Accepted: 03/10/2024] [Indexed: 03/26/2024]
Abstract
OBJECTIVE To evaluate the quality of ChatGPT responses to common issues in obstetrics and assess its ability to provide reliable responses to pregnant individuals. The study aimed to examine the responses based on expert opinions using predetermined criteria, including "accuracy," "completeness," and "safety." METHODS We curated 15 common and potentially clinically significant questions that pregnant women are asking. Two native English-speaking women were asked to reframe the questions in their own words, and we employed the ChatGPT language model to generate responses to the questions. To evaluate the accuracy, completeness, and safety of the ChatGPT's generated responses, we developed a questionnaire with a scale of 1 to 5 that obstetrics and gynecology experts from different countries were invited to rate accordingly. The ratings were analyzed to evaluate the average level of agreement and percentage of positive ratings (≥4) for each criterion. RESULTS Of the 42 experts invited, 20 responded to the questionnaire. The combined score for all responses yielded a mean rating of 4, with 75% of responses receiving a positive rating (≥4). While examining specific criteria, the ChatGPT responses were better for the accuracy criterion, with a mean rating of 4.2 and 80% of the questions received a positive rating. The responses scored less for the completeness criterion, with a mean rating of 3.8 and 46.7% of questions received a positive rating. For safety, the mean rating was 3.9 and 53.3% of questions received a positive rating. There was no response with an average negative rating below three. CONCLUSION This study demonstrates promising results regarding potential use of ChatGPT's in providing accurate responses to obstetric clinical questions posed by pregnant women. However, it is crucial to exercise caution when addressing inquiries concerning the safety of the fetus or the mother.
Collapse
Affiliation(s)
- Tzuria Peled
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Hen Y Sela
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Ari Weiss
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Sorina Grisaru-Granovsky
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Swati Agrawal
- Division of Maternal-Fetal Medicine, Department of Obstetrics and Gynecology, Hamilton Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Misgav Rottenstreich
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
- Division of Maternal-Fetal Medicine, Department of Obstetrics and Gynecology, Hamilton Health Sciences, McMaster University, Hamilton, Ontario, Canada
- Department of Nursing, Jerusalem College of Technology, Jerusalem, Israel
| |
Collapse
|
23
|
Şenoymak MC, Erbatur NH, Şenoymak İ, Fırat SN. The Role of Artificial Intelligence in Endocrine Management: Assessing ChatGPT's Responses to Prolactinoma Queries. J Pers Med 2024; 14:330. [PMID: 38672957 PMCID: PMC11051052 DOI: 10.3390/jpm14040330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 03/15/2024] [Accepted: 03/21/2024] [Indexed: 04/28/2024] Open
Abstract
This research investigates the utility of Chat Generative Pre-trained Transformer (ChatGPT) in addressing patient inquiries related to hyperprolactinemia and prolactinoma. A set of 46 commonly asked questions from patients with prolactinoma were presented to ChatGPT and responses were evaluated for accuracy with a 6-point Likert scale (1: completely inaccurate to 6: completely accurate) and adequacy with a 5-point Likert scale (1: completely inadequate to 5: completely adequate). Two independent endocrinologists assessed the responses, based on international guidelines. Questions were categorized into groups including general information, diagnostic process, treatment process, follow-up, and pregnancy period. The median accuracy score was 6.0 (IQR, 5.4-6.0), and the adequacy score was 4.5 (IQR, 3.5-5.0). The lowest accuracy and adequacy score assigned by both evaluators was two. Significant agreement was observed between the evaluators, demonstrated by a weighted κ of 0.68 (p = 0.08) for accuracy and a κ of 0.66 (p = 0.04) for adequacy. The Kruskal-Wallis tests revealed statistically significant differences among the groups for accuracy (p = 0.005) and adequacy (p = 0.023). The pregnancy period group had the lowest accuracy score and both pregnancy period and follow-up groups had the lowest adequacy score. In conclusion, ChatGPT demonstrated commendable responses in addressing prolactinoma queries; however, certain limitations were observed, particularly in providing accurate information related to the pregnancy period, emphasizing the need for refining its capabilities in medical contexts.
Collapse
Affiliation(s)
- Mustafa Can Şenoymak
- Department of Endocrinology and Metabolism, University of Health Sciences Sultan, Abdulhamid Han Training and Research Hospital, Istanbul 34668, Turkey
| | - Nuriye Hale Erbatur
- Department of Endocrinology and Metabolism, University of Health Sciences Sultan, Abdulhamid Han Training and Research Hospital, Istanbul 34668, Turkey
| | - İrem Şenoymak
- Family Medicine Department, Usküdar State Hospital, Istanbul 34662, Turkey
| | - Sevde Nur Fırat
- Department of Endocrinology and Metabolism, University of Health Sciences, Ankara Training and Research Hospital, Ankara 06230, Turkey
| |
Collapse
|
24
|
Chervonski E, Harish KB, Rockman CB, Sadek M, Teter KA, Jacobowitz GR, Berland TL, Lohr J, Moore C, Maldonado TS. Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients. Vascular 2024:17085381241240550. [PMID: 38500300 DOI: 10.1177/17085381241240550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
OBJECTIVES Generative artificial intelligence (AI) has emerged as a promising tool to engage with patients. The objective of this study was to assess the quality of AI responses to common patient questions regarding vascular surgery disease processes. METHODS OpenAI's ChatGPT-3.5 and Google Bard were queried with 24 mock patient questions spanning seven vascular surgery disease domains. Six experienced vascular surgery faculty at a tertiary academic center independently graded AI responses on their accuracy (rated 1-4 from completely inaccurate to completely accurate), completeness (rated 1-4 from totally incomplete to totally complete), and appropriateness (binary). Responses were also evaluated with three readability scales. RESULTS ChatGPT responses were rated, on average, more accurate than Bard responses (3.08 ± 0.33 vs 2.82 ± 0.40, p < .01). ChatGPT responses were scored, on average, more complete than Bard responses (2.98 ± 0.34 vs 2.62 ± 0.36, p < .01). Most ChatGPT responses (75.0%, n = 18) and almost half of Bard responses (45.8%, n = 11) were unanimously deemed appropriate. Almost one-third of Bard responses (29.2%, n = 7) were deemed inappropriate by at least two reviewers (29.2%), and two Bard responses (8.4%) were considered inappropriate by the majority. The mean Flesch Reading Ease, Flesch-Kincaid Grade Level, and Gunning Fog Index of ChatGPT responses were 29.4 ± 10.8, 14.5 ± 2.2, and 17.7 ± 3.1, respectively, indicating that responses were readable with a post-secondary education. Bard's mean readability scores were 58.9 ± 10.5, 8.2 ± 1.7, and 11.0 ± 2.0, respectively, indicating that responses were readable with a high-school education (p < .0001 for three metrics). ChatGPT's mean response length (332 ± 79 words) was higher than Bard's mean response length (183 ± 53 words, p < .001). There was no difference in the accuracy, completeness, readability, or response length of ChatGPT or Bard between disease domains (p > .05 for all analyses). CONCLUSIONS AI offers a novel means of educating patients that avoids the inundation of information from "Dr Google" and the time barriers of physician-patient encounters. ChatGPT provides largely valid, though imperfect, responses to myriad patient questions at the expense of readability. While Bard responses are more readable and concise, their quality is poorer. Further research is warranted to better understand failure points for large language models in vascular surgery patient education.
Collapse
Affiliation(s)
- Ethan Chervonski
- New York University Grossman School of Medicine, New York, NY, USA
| | - Keerthi B Harish
- New York University Grossman School of Medicine, New York, NY, USA
| | - Caron B Rockman
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Mikel Sadek
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Katherine A Teter
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Glenn R Jacobowitz
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Todd L Berland
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Joann Lohr
- Dorn Veterans Affairs Medical Center, Columbia, SC, USA
| | | | - Thomas S Maldonado
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| |
Collapse
|
25
|
Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E, D'Onofrio NC, Rizzo S. Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. Br J Ophthalmol 2024:bjo-2023-325143. [PMID: 38448201 DOI: 10.1136/bjo-2023-325143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 02/16/2024] [Indexed: 03/08/2024]
Abstract
BACKGROUND We aimed to define the capability of three different publicly available large language models, Chat Generative Pretrained Transformer (ChatGPT-3.5), ChatGPT-4 and Google Gemini in analysing retinal detachment cases and suggesting the best possible surgical planning. METHODS Analysis of 54 retinal detachments records entered into ChatGPT and Gemini's interfaces. After asking 'Specify what kind of surgical planning you would suggest and the eventual intraocular tamponade.' and collecting the given answers, we assessed the level of agreement with the common opinion of three expert vitreoretinal surgeons. Moreover, ChatGPT and Gemini answers were graded 1-5 (from poor to excellent quality), according to the Global Quality Score (GQS). RESULTS After excluding 4 controversial cases, 50 cases were included. Overall, ChatGPT-3.5, ChatGPT-4 and Google Gemini surgical choices agreed with those of vitreoretinal surgeons in 40/50 (80%), 42/50 (84%) and 35/50 (70%) of cases. Google Gemini was not able to respond in five cases. Contingency analysis showed significant differences between ChatGPT-4 and Gemini (p=0.03). ChatGPT's GQS were 3.9±0.8 and 4.2±0.7 for versions 3.5 and 4, while Gemini scored 3.5±1.1. There was no statistical difference between the two ChatGPTs (p=0.22), while both outperformed Gemini scores (p=0.03 and p=0.002, respectively). The main source of error was endotamponade choice (14% for ChatGPT-3.5 and 4, and 12% for Google Gemini). Only ChatGPT-4 was able to suggest a combined phacovitrectomy approach. CONCLUSION In conclusion, Google Gemini and ChatGPT evaluated vitreoretinal patients' records in a coherent manner, showing a good level of agreement with expert surgeons. According to the GQS, ChatGPT's recommendations were much more accurate and precise.
Collapse
Affiliation(s)
- Matteo Mario Carlà
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Gloria Gambini
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Antonio Baldascino
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Federico Giannuzzi
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Francesco Boselli
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Emanuele Crincoli
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Nicola Claudio D'Onofrio
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Stanislao Rizzo
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| |
Collapse
|
26
|
Wu RC, Li DX, Feng DC. Re: Michael Eppler, Conner Ganjavi, Lorenzo Storino Ramacciotti, et al. Awareness and Use of ChatGPT and Large Language Models: A Prospective Cross-sectional Global Survey in Urology. Eur Urol. 2024;85:146-53. Eur Urol 2024; 85:e87-e88. [PMID: 38151444 DOI: 10.1016/j.eururo.2023.11.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 11/23/2023] [Indexed: 12/29/2023]
Affiliation(s)
- Rui-Cheng Wu
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, China
| | - Deng-Xiong Li
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, China
| | - De-Chao Feng
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, China.
| |
Collapse
|
27
|
Wu SH, Tong WJ, Li MD, Hu HT, Lu XZ, Huang ZR, Lin XX, Lu RF, Lu MD, Chen LD, Wang W. Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models. Radiology 2024; 310:e232255. [PMID: 38470237 DOI: 10.1148/radiol.232255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/13/2024]
Abstract
Background Large language models (LLMs) hold substantial promise for medical imaging interpretation. However, there is a lack of studies on their feasibility in handling reasoning questions associated with medical diagnosis. Purpose To investigate the viability of leveraging three publicly available LLMs to enhance consistency and diagnostic accuracy in medical imaging based on standardized reporting, with pathology as the reference standard. Materials and Methods US images of thyroid nodules with pathologic results were retrospectively collected from a tertiary referral hospital between July 2022 and December 2022 and used to evaluate malignancy diagnoses generated by three LLMs-OpenAI's ChatGPT 3.5, ChatGPT 4.0, and Google's Bard. Inter- and intra-LLM agreement of diagnosis were evaluated. Then, diagnostic performance, including accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), was evaluated and compared for the LLMs and three interactive approaches: human reader combined with LLMs, image-to-text model combined with LLMs, and an end-to-end convolutional neural network model. Results A total of 1161 US images of thyroid nodules (498 benign, 663 malignant) from 725 patients (mean age, 42.2 years ± 14.1 [SD]; 516 women) were evaluated. ChatGPT 4.0 and Bard displayed substantial to almost perfect intra-LLM agreement (κ range, 0.65-0.86 [95% CI: 0.64, 0.86]), while ChatGPT 3.5 showed fair to substantial agreement (κ range, 0.36-0.68 [95% CI: 0.36, 0.68]). ChatGPT 4.0 had an accuracy of 78%-86% (95% CI: 76%, 88%) and sensitivity of 86%-95% (95% CI: 83%, 96%), compared with 74%-86% (95% CI: 71%, 88%) and 74%-91% (95% CI: 71%, 93%), respectively, for Bard. Moreover, with ChatGPT 4.0, the image-to-text-LLM strategy exhibited an AUC (0.83 [95% CI: 0.80, 0.85]) and accuracy (84% [95% CI: 82%, 86%]) comparable to those of the human-LLM interaction strategy with two senior readers and one junior reader and exceeding those of the human-LLM interaction strategy with one junior reader. Conclusion LLMs, particularly integrated with image-to-text approaches, show potential in enhancing diagnostic medical imaging. ChatGPT 4.0 was optimal for consistency and diagnostic accuracy when compared with Bard and ChatGPT 3.5. © RSNA, 2024 Supplemental material is available for this article.
Collapse
Affiliation(s)
- Shao-Hong Wu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Wen-Juan Tong
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Ming-De Li
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Hang-Tong Hu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Xiao-Zhou Lu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Ze-Rong Huang
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Xin-Xin Lin
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Rui-Fang Lu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Ming-De Lu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Li-Da Chen
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Wei Wang
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| |
Collapse
|
28
|
Coskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int 2024; 44:509-515. [PMID: 37747564 DOI: 10.1007/s00296-023-05473-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 09/14/2023] [Indexed: 09/26/2023]
Abstract
We aimed to assess Large Language Models (LLMs)-ChatGPT 3.5-4, BARD, and Bing-in their accuracy and completeness when answering Methotrexate (MTX) related questions for treating rheumatoid arthritis. We employed 23 questions from an earlier study related to MTX concerns. These questions were entered into the LLMs, and the responses generated by each model were evaluated by two reviewers using Likert scales to assess accuracy and completeness. The GPT models achieved a 100% correct answer rate, while BARD and Bing scored 73.91%. In terms of accuracy of the outputs (completely correct responses), GPT-4 achieved a score of 100%, GPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. BARD produced 17.39% incorrect responses and 8.7% non-responses, while Bing recorded 13.04% incorrect and 13.04% non-responses. The ChatGPT models produced significantly more accurate responses than Bing for the "mechanism of action" category, and GPT-4 model showed significantly higher accuracy than BARD in the "side effects" category. There were no statistically significant differences among the models for the "lifestyle" category. GPT-4 achieved a comprehensive output of 100%, followed by GPT-3.5 at 86.96%, BARD at 60.86%, and Bing at 0%. In the "mechanism of action" category, both ChatGPT models and BARD produced significantly more comprehensive outputs than Bing. For the "side effects" and "lifestyle" categories, the ChatGPT models showed significantly higher completeness than Bing. The GPT models, particularly GPT 4, demonstrated superior performance in providing accurate and comprehensive patient information about MTX use. However, the study also identified inaccuracies and shortcomings in the generated responses.
Collapse
Affiliation(s)
- Belkis Nihan Coskun
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey.
| | - Burcu Yagiz
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Gokhan Ocakoglu
- Department of Biostatistics, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Ediz Dalkilic
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Yavuz Pehlivan
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| |
Collapse
|
29
|
Abi-Rafeh J, Mroueh VJ, Bassiri-Tehrani B, Marks J, Kazan R, Nahai F. Complications Following Body Contouring: Performance Validation of Bard, a Novel AI Large Language Model, in Triaging and Managing Postoperative Patient Concerns. Aesthetic Plast Surg 2024; 48:953-976. [PMID: 38273152 DOI: 10.1007/s00266-023-03819-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 12/14/2023] [Indexed: 01/27/2024]
Abstract
INTRODUCTION Large language models (LLM) have revolutionized the way humans interact with artificial intelligence (AI) technology, with marked potential for applications in esthetic surgery. The present study evaluates the performance of Bard, a novel LLM, in identifying and managing postoperative patient concerns for complications following body contouring surgery. METHODS The American Society of Plastic Surgeons' website was queried to identify and simulate all potential postoperative complications following body contouring across different acuities and severity. Bard's accuracy was assessed in providing a differential diagnosis, soliciting a history, suggesting a most-likely diagnosis, appropriate disposition, treatments/interventions to begin from home, and red-flag signs/symptoms indicating deterioration, or requiring urgent emergency department (ED) presentation. RESULTS Twenty-two simulated body contouring complications were examined. Overall, Bard demonstrated a 59% accuracy in listing relevant diagnoses on its differentials, with a 52% incidence of incorrect or misleading diagnoses. Following history-taking, Bard demonstrated an overall accuracy of 44% in identifying the most-likely diagnosis, and a 55% accuracy in suggesting the indicated medical dispositions. Helpful treatments/interventions to begin from home were suggested with a 40% accuracy, whereas red-flag signs/symptoms, indicating deterioration, were shared with a 48% accuracy. A detailed analysis of performance, stratified according to latency of postoperative presentation (<48hours, 48hours-1month, or >1month postoperatively), and according to acuity and indicated medical disposition, is presented herein. CONCLUSIONS Despite promising potential of LLMs and AI in healthcare-related applications, Bard's performance in the present study significantly falls short of accepted clinical standards, thus indicating a need for further research and development prior to adoption. LEVEL OF EVIDENCE IV This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Jad Abi-Rafeh
- Division of Plastic, Reconstructive, and Aesthetic Surgery, McGill University Health Centre, Montreal, QC, Canada
| | - Vanessa J Mroueh
- Brigham and Women's Hospital, Harvard Medical School, Boston Massachusetts, USA
| | | | - Jacob Marks
- Manhattan Eye, Ear, and Throat Hospital, New York, NY, USA
| | - Roy Kazan
- Division of Plastic, Reconstructive, and Aesthetic Surgery, McGill University Health Centre, Montreal, QC, Canada
| | - Foad Nahai
- Department of Surgery, Emory University, Atlanta, GA, USA.
| |
Collapse
|
30
|
Rahsepar AA. Large Language Models for Enhancing Radiology Report Impressions: Improve Readability While Decreasing Burnout. Radiology 2024; 310:e240498. [PMID: 38530179 DOI: 10.1148/radiol.240498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Affiliation(s)
- Amir Ali Rahsepar
- From the Department of Radiology, Northwestern Memorial Hospital, 676 N Saint Clair St, Arkes Family Pavilion Suite 800, Chicago, IL 60611
| |
Collapse
|
31
|
Bera K, O'Connor G, Jiang S, Tirumani SH, Ramaiya N. Analysis of ChatGPT publications in radiology: Literature so far. Curr Probl Diagn Radiol 2024; 53:215-225. [PMID: 37891083 DOI: 10.1067/j.cpradiol.2023.10.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 10/18/2023] [Indexed: 10/29/2023]
Abstract
OBJECTIVE To perform a detailed qualitative and quantitative analysis of the published literature on ChatGPT and radiology in the nine months since its public release, detailing the scope of the work in the short timeframe. METHODS A systematic literature search was carried out of the MEDLINE, EMBASE databases through August 15, 2023 for articles that were focused on ChatGPT and imaging/radiology. Articles were classified into original research and reviews/perspectives. Quantitative analysis was carried out by two experienced radiologists using objective scoring systems for evaluating original and non-original research. RESULTS 51 articles were published involving ChatGPT and radiology/imaging dating from 26 Jan 2023 to the last article published on 14 Aug 2023. 23 articles were original research while the rest included reviews/perspectives or brief communications. For quantitative analysis scored by two readers, we included 23 original research and 17 non-original research articles (after excluding 11 letters as responses to previous articles). Mean score for original research was 3.20 out of 5 (across five questions), while mean score for non-original research was 1.17 out of 2 (across six questions). Mean score grading performance of ChatGPT in original research was 3.20 out of five (across two questions). DISCUSSION While it is early days for ChatGPT and its impact in radiology, there has already been a plethora of articles talking about the multifaceted nature of the tool and how it can impact every aspect of radiology from patient education, pre-authorization, protocol selection, generating differentials, to structuring radiology reports. Most articles show impressive performance of ChatGPT which can only improve with more research and improvements in the tool itself. There have also been several articles which have highlighted the limitations of ChatGPT in its current iteration, which will allow radiologists and researchers to improve these areas.
Collapse
Affiliation(s)
- Kaustav Bera
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA.
| | - Gregory O'Connor
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Sirui Jiang
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Sree Harsha Tirumani
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Nikhil Ramaiya
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| |
Collapse
|
32
|
Doddi S, Hibshman T, Salichs O, Bera K, Tippareddy C, Ramaiya N, Tirumani SH. Assessing appropriate responses to ACR urologic imaging scenarios using ChatGPT and Bard. Curr Probl Diagn Radiol 2024; 53:226-229. [PMID: 37891086 DOI: 10.1067/j.cpradiol.2023.10.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 10/18/2023] [Accepted: 10/18/2023] [Indexed: 10/29/2023]
Abstract
Artificial intelligence (AI) has recently become a trending tool and topic regarding productivity especially with publicly available free services such as ChatGPT and Bard. In this report, we investigate if two widely available chatbots chatGPT and Bard, are able to show consistent accurate responses for the best imaging modality for urologic clinical situations and if they are in line with American College of Radiology (ACR) Appropriateness Criteria (AC). All clinical scenarios provided by the ACR were inputted into ChatGPT and Bard with result compared to the ACR AC and recorded. Both chatbots had an appropriate imaging modality rate of of 62% and no significant difference in proportion of correct imaging modality was found overall between the two services (p>0.05). The results of our study found that both ChatGPT and Bard are similar in their ability to suggest the most appropriate imaging modality in a variety of urologic scenarios based on ACR AC criteria. Nonetheless, both chatbots lack consistent accuracy and further development is necessary for implementation in clinical settings. For proper use of these AI services in clinical decision making, further developments are needed to improve the workflow of physicians.
Collapse
Affiliation(s)
- Sishir Doddi
- University of Toledo College of Medicine, Toledo, OH, United States.
| | - Taryn Hibshman
- University of Toledo College of Medicine, Toledo, OH, United States
| | - Oscar Salichs
- University of Toledo College of Medicine, Toledo, OH, United States
| | - Kaustav Bera
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Charit Tippareddy
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Nikhil Ramaiya
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Sree Harsha Tirumani
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| |
Collapse
|
33
|
Hu Y, Hu Z, Liu W, Gao A, Wen S, Liu S, Lin Z. Exploring the potential of ChatGPT as an adjunct for generating diagnosis based on chief complaint and cone beam CT radiologic findings. BMC Med Inform Decis Mak 2024; 24:55. [PMID: 38374067 PMCID: PMC10875853 DOI: 10.1186/s12911-024-02445-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 01/28/2024] [Indexed: 02/21/2024] Open
Abstract
AIM This study aimed to assess the performance of OpenAI's ChatGPT in generating diagnosis based on chief complaint and cone beam computed tomography (CBCT) radiologic findings. MATERIALS AND METHODS 102 CBCT reports (48 with dental diseases (DD) and 54 with neoplastic/cystic diseases (N/CD)) were collected. ChatGPT was provided with chief complaint and CBCT radiologic findings. Diagnostic outputs from ChatGPT were scored based on five-point Likert scale. For diagnosis accuracy, the scoring was based on the accuracy of chief complaint related diagnosis and chief complaint unrelated diagnoses (1-5 points); for diagnosis completeness, the scoring was based on how many accurate diagnoses included in ChatGPT's output for one case (1-5 points); for text quality, the scoring was based on how many text errors included in ChatGPT's output for one case (1-5 points). For 54 N/CD cases, the consistence of the diagnosis generated by ChatGPT with pathological diagnosis was also calculated. The constitution of text errors in ChatGPT's outputs was evaluated. RESULTS After subjective ratings by expert reviewers on a five-point Likert scale, the final score of diagnosis accuracy, diagnosis completeness and text quality of ChatGPT was 3.7, 4.5 and 4.6 for the 102 cases. For diagnostic accuracy, it performed significantly better on N/CD (3.8/5) compared to DD (3.6/5). For 54 N/CD cases, 21(38.9%) cases have first diagnosis completely consistent with pathological diagnosis. No text errors were observed in 88.7% of all the 390 text items. CONCLUSION ChatGPT showed potential in generating radiographic diagnosis based on chief complaint and radiologic findings. However, the performance of ChatGPT varied with task complexity, necessitating professional oversight due to a certain error rate.
Collapse
Affiliation(s)
- Yanni Hu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Ziyang Hu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
- Department of Stomatology, Shenzhen Longhua District Central Hospital, Shenzhen, People's Republic of China
| | - Wenjing Liu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Antian Gao
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Shanhui Wen
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Shu Liu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Zitong Lin
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China.
| |
Collapse
|
34
|
Peng W, Feng Y, Yao C, Zhang S, Zhuo H, Qiu T, Zhang Y, Tang J, Gu Y, Sun Y. Evaluating AI in medicine: a comparative analysis of expert and ChatGPT responses to colorectal cancer questions. Sci Rep 2024; 14:2840. [PMID: 38310152 PMCID: PMC10838275 DOI: 10.1038/s41598-024-52853-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Accepted: 01/24/2024] [Indexed: 02/05/2024] Open
Abstract
Colorectal cancer (CRC) is a global health challenge, and patient education plays a crucial role in its early detection and treatment. Despite progress in AI technology, as exemplified by transformer-like models such as ChatGPT, there remains a lack of in-depth understanding of their efficacy for medical purposes. We aimed to assess the proficiency of ChatGPT in the field of popular science, specifically in answering questions related to CRC diagnosis and treatment, using the book "Colorectal Cancer: Your Questions Answered" as a reference. In general, 131 valid questions from the book were manually input into ChatGPT. Responses were evaluated by clinical physicians in the relevant fields based on comprehensiveness and accuracy of information, and scores were standardized for comparison. Not surprisingly, ChatGPT showed high reproducibility in its responses, with high uniformity in comprehensiveness, accuracy, and final scores. However, the mean scores of ChatGPT's responses were significantly lower than the benchmarks, indicating it has not reached an expert level of competence in CRC. While it could provide accurate information, it lacked in comprehensiveness. Notably, ChatGPT performed well in domains of radiation therapy, interventional therapy, stoma care, venous care, and pain control, almost rivaling the benchmarks, but fell short in basic information, surgery, and internal medicine domains. While ChatGPT demonstrated promise in specific domains, its general efficiency in providing CRC information falls short of expert standards, indicating the need for further advancements and improvements in AI technology for patient education in healthcare.
Collapse
Affiliation(s)
- Wen Peng
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Yifei Feng
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Cui Yao
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Sheng Zhang
- Department of Radiotherapy, The First Affiliated Hospital with Nanjing Medical University, Nanjing, People's Republic of China
| | - Han Zhuo
- Department of Intervention, The First Affiliated Hospital with Nanjing Medical University, Nanjing, People's Republic of China
| | - Tianzhu Qiu
- Department of Oncology, The First Affiliated Hospital with Nanjing Medical University, Nanjing, People's Republic of China
| | - Yi Zhang
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Junwei Tang
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China.
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China.
| | - Yanhong Gu
- Department of Oncology, The First Affiliated Hospital with Nanjing Medical University, Nanjing, People's Republic of China.
| | - Yueming Sun
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China.
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China.
| |
Collapse
|
35
|
Toyama Y, Harigai A, Abe M, Nagano M, Kawabata M, Seki Y, Takase K. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol 2024; 42:201-207. [PMID: 37792149 PMCID: PMC10811006 DOI: 10.1007/s11604-023-01491-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 09/12/2023] [Indexed: 10/05/2023]
Abstract
PURPOSE Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE). MATERIALS AND METHODS In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar's test was used to compare the proportion of correct responses between the LLMs. Fisher's exact test was used to assess the performance of GPT-4 for each topic category. RESULTS ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% (p < 0.001) and Google Bard by 26.2% (p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard (p < 0.001). The categorical analysis by question pattern revealed GPT-4's superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001). CONCLUSION ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.
Collapse
Affiliation(s)
- Yoshitaka Toyama
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan.
| | - Ayaka Harigai
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan
- Department of Radiology, Tohoku Medical and Pharmaceutical University, Sendai, Japan
- Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, Sendai, Japan
| | - Mirei Abe
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan
| | | | - Masahiro Kawabata
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan
| | - Yasuhiro Seki
- Department of Radiation Oncology, Tohoku University Hospital, Sendai, Japan
| | - Kei Takase
- Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, Sendai, Japan
| |
Collapse
|
36
|
Kim S, Lee CK, Kim SS. Large Language Models: A Guide for Radiologists. Korean J Radiol 2024; 25:126-133. [PMID: 38288895 PMCID: PMC10831297 DOI: 10.3348/kjr.2023.0997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 11/27/2023] [Accepted: 12/18/2023] [Indexed: 02/01/2024] Open
Abstract
Large language models (LLMs) have revolutionized the global landscape of technology beyond natural language processing. Owing to their extensive pre-training on vast datasets, contemporary LLMs can handle tasks ranging from general functionalities to domain-specific areas, such as radiology, without additional fine-tuning. General-purpose chatbots based on LLMs can optimize the efficiency of radiologists in terms of their professional work and research endeavors. Importantly, these LLMs are on a trajectory of rapid evolution, wherein challenges such as "hallucination," high training cost, and efficiency issues are addressed, along with the inclusion of multimodal inputs. In this review, we aim to offer conceptual knowledge and actionable guidance to radiologists interested in utilizing LLMs through a succinct overview of the topic and a summary of radiology-specific aspects, from the beginning to potential future directions.
Collapse
Affiliation(s)
- Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
- AIGEN Sciences, Seoul, Republic of Korea
| | - Choong-Kun Lee
- Division of Medical Oncology, Department of Internal Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Seung-Seob Kim
- Department of Radiology and Research Institute of Radiological Science, Severance Hospital, Yonsei University College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
37
|
Patil NS, Huang RS, Caterine S, Yao J, Larocque N, van der Pol CB, Stubbs E. Artificial Intelligence Chatbots' Understanding of the Risks and Benefits of Computed Tomography and Magnetic Resonance Imaging Scenarios. Can Assoc Radiol J 2024:8465371231220561. [PMID: 38183235 DOI: 10.1177/08465371231220561] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2024] Open
Abstract
PURPOSE Patients may seek online information to better understand medical imaging procedures. The purpose of this study was to assess the accuracy of information provided by 2 popular artificial intelligence (AI) chatbots pertaining to common imaging scenarios' risks, benefits, and alternatives. METHODS Fourteen imaging-related scenarios pertaining to computed tomography (CT) or magnetic resonance imaging (MRI) were used. Factors including the use of intravenous contrast, the presence of renal disease, and whether the patient was pregnant were included in the analysis. For each scenario, 3 prompts for outlining the (1) risks, (2) benefits, and (3) alternative imaging choices or potential implications of not using contrast were inputted into ChatGPT and Bard. A grading rubric and a 5-point Likert scale was used by 2 independent reviewers to grade responses. Prompt variability and chatbot context dependency were also assessed. RESULTS ChatGPT's performance was superior to Bard's in accurately responding to prompts per Likert grading (4.36 ± 0.63 vs 3.25 ± 1.03 seconds, P < .0001). There was substantial agreement between independent reviewer grading for ChatGPT (κ = 0.621) and Bard (κ = 0.684). Response text length was not statistically different between ChatGPT and Bard (2087 ± 256 characters vs 2162 ± 369 characters, P = .24). Response time was longer for ChatGPT (34 ± 2 vs 8 ± 1 seconds, P < .0001). CONCLUSIONS ChatGPT performed superior to Bard at outlining risks, benefits, and alternatives to common imaging scenarios. Generally, context dependency and prompt variability did not change chatbot response content. Due to the lack of detailed scientific reasoning and inability to provide patient-specific information, both AI chatbots have limitations as a patient information resource.
Collapse
Affiliation(s)
- Nikhil S Patil
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Ryan S Huang
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Scott Caterine
- Department of Radiology, McMaster University, Hamilton, ON, Canada
| | - Jason Yao
- Department of Radiology, McMaster University, Hamilton, ON, Canada
| | - Natasha Larocque
- Department of Radiology, McMaster University, Hamilton, ON, Canada
| | | | - Euan Stubbs
- Department of Radiology, McMaster University, Hamilton, ON, Canada
| |
Collapse
|
38
|
Mediboina A, Badam RK, Chodavarapu S. Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI. Cureus 2024; 16:e51544. [PMID: 38318564 PMCID: PMC10840059 DOI: 10.7759/cureus.51544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/01/2024] [Indexed: 02/07/2024] Open
Abstract
Background and objective ChatGPT and Google Bard AI are widely used conversational chatbots, even in healthcare. While they have several strengths, they can generate seemingly correct but erroneous responses, warranting caution in medical contexts. In an era where access to abortion care is diminishing, patients may increasingly rely on online resources and AI-driven language models for information on medication abortions. In light of this, this study aimed to compare the accuracy and comprehensiveness of responses generated by ChatGPT 3.5 and Google Bard AI to medical queries about medication abortions. Methods Fourteen open-ended questions about medication abortion were formulated based on the Frequently Asked Questions (FAQs) from the National Abortion Federation (NAF) and the Reproductive Health Access Project (RHAP) websites. These questions were answered using ChatGPT version 3.5 and Google Bard AI on October 7, 2023. The accuracy of the responses was analyzed by cross-referencing the generated answers against the information provided by NAF and RHAP. Any discrepancies were further verified against the guidelines from the American Congress of Obstetricians and Gynecologists (ACOG). A rating scale used by Johnson et al. was employed for assessment, utilizing a 6-point Likert scale [ranging from 1 (completely incorrect) to 6 (correct)] to evaluate accuracy and a 3-point scale [ranging from 1 (incomplete) to 3 (comprehensive)] to assess completeness. Questions that did not yield answers were assigned a score of 0 and omitted from the correlation analysis. Data analysis and visualization were done using R Software version 4.3.1. Statistical significance was determined by employing Spearman's R and Mann-Whitney U tests. Results All questions were entered sequentially into both chatbots by the same author. On the initial attempt, ChatGPT successfully generated relevant responses for all questions, while Google Bard AI failed to provide answers for five questions. Repeating the same question in Google Bard AI yielded an answer for one; two were answered with different phrasing; and two remained unanswered despite rephrasing. ChatGPT showed a median accuracy score of 5 (mean: 5.26, SD: 0.73) and a median completeness score of 3 (mean: 2.57, SD: 0.51). It showed the highest accuracy score in six responses and the highest completeness score in eight responses. In contrast, Google Bard AI had a median accuracy score of 5 (mean: 4.5, SD: 2.03) and a median completeness score of 2 (mean: 2.14, SD: 1.03). It achieved the highest accuracy score in five responses and the highest completeness score in six responses. Spearman's correlation coefficient revealed no correlation between accuracy and completeness for ChatGPT (rs = -0.46771, p = 0.09171). However, Google Bard AI showed a marginally significant correlation (rs = 0.5738, p = 0.05108). Mann-Whitney U test indicated no statistically significant differences between ChatGPT and Google Bard AI concerning accuracy (U = 82, p>0.05) or completeness (U = 78, p>0.05). Conclusion While both chatbots showed similar levels of accuracy, minor errors were noted, pertaining to finer aspects that demand specialized knowledge of abortion care. This could explain the lack of a significant correlation between accuracy and completeness. Ultimately, AI-driven language models have the potential to provide information on medication abortions, but there is a need for continual refinement and oversight.
Collapse
Affiliation(s)
- Anjali Mediboina
- Community Medicine, Alluri Sita Ramaraju Academy of Medical Sciences, Eluru, IND
| | - Rajani Kumari Badam
- Obstetrics and Gynaecology, Sri Venkateswara Medical College, Tirupathi, IND
| | - Sailaja Chodavarapu
- Obstetrics and Gynaecology, Government Medical College, Rajamahendravaram, IND
| |
Collapse
|
39
|
Park SH. Noteworthy Developments in the Korean Journal of Radiology in 2023 and for 2024. Korean J Radiol 2024; 25:1-5. [PMID: 38184762 PMCID: PMC10788598 DOI: 10.3348/kjr.2023.1172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 11/22/2023] [Indexed: 01/08/2024] Open
Affiliation(s)
- Seong Ho Park
- Department of Radiology and Research Institute of Radiology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
40
|
Bhayana R. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology 2024; 310:e232756. [PMID: 38226883 DOI: 10.1148/radiol.232756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2024]
Abstract
Although chatbots have existed for decades, the emergence of transformer-based large language models (LLMs) has captivated the world through the most recent wave of artificial intelligence chatbots, including ChatGPT. Transformers are a type of neural network architecture that enables better contextual understanding of language and efficient training on massive amounts of unlabeled data, such as unstructured text from the internet. As LLMs have increased in size, their improved performance and emergent abilities have revolutionized natural language processing. Since language is integral to human thought, applications based on LLMs have transformative potential in many industries. In fact, LLM-based chatbots have demonstrated human-level performance on many professional benchmarks, including in radiology. LLMs offer numerous clinical and research applications in radiology, several of which have been explored in the literature with encouraging results. Multimodal LLMs can simultaneously interpret text and images to generate reports, closely mimicking current diagnostic pathways in radiology. Thus, from requisition to report, LLMs have the opportunity to positively impact nearly every step of the radiology journey. Yet, these impressive models are not without limitations. This article reviews the limitations of LLMs and mitigation strategies, as well as potential uses of LLMs, including multimodal models. Also reviewed are existing LLM-based applications that can enhance efficiency in supervised settings.
Collapse
Affiliation(s)
- Rajesh Bhayana
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital, and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Bldg, 1st Fl, Toronto, ON, Canada M5G 24C
| |
Collapse
|
41
|
Gan RK, Ogbodo JC, Wee YZ, Gan AZ, González PA. Performance of Google bard and ChatGPT in mass casualty incidents triage. Am J Emerg Med 2024; 75:72-78. [PMID: 37967485 DOI: 10.1016/j.ajem.2023.10.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 10/03/2023] [Accepted: 10/24/2023] [Indexed: 11/17/2023] Open
Abstract
AIM The objective of our research is to evaluate and compare the performance of ChatGPT, Google Bard, and medical students in performing START triage during mass casualty situations. METHOD We conducted a cross-sectional analysis to compare ChatGPT, Google Bard, and medical students in mass casualty incident (MCI) triage using the Simple Triage And Rapid Treatment (START) method. A validated questionnaire with 15 diverse MCI scenarios was used to assess triage accuracy and content analysis in four categories: "Walking wounded," "Respiration," "Perfusion," and "Mental Status." Statistical analysis compared the results. RESULT Google Bard demonstrated a notably higher accuracy of 60%, while ChatGPT achieved an accuracy of 26.67% (p = 0.002). Comparatively, medical students performed at an accuracy rate of 64.3% in a previous study. However, there was no significant difference observed between Google Bard and medical students (p = 0.211). Qualitative content analysis of 'walking-wounded', 'respiration', 'perfusion', and 'mental status' indicated that Google Bard outperformed ChatGPT. CONCLUSION Google Bard was found to be superior to ChatGPT in correctly performing mass casualty incident triage. Google Bard achieved an accuracy of 60%, while chatGPT only achieved an accuracy of 26.67%. This difference was statistically significant (p = 0.002).
Collapse
Affiliation(s)
- Rick Kye Gan
- Unit for Research in Emergency and Disaster, Faculty of Medicine and Health Sciences, University of Oviedo, Oviedo 33006, Spain.
| | - Jude Chukwuebuka Ogbodo
- Unit for Research in Emergency and Disaster, Faculty of Medicine and Health Sciences, University of Oviedo, Oviedo 33006, Spain; Department of Primary Care and Population Health, Medical School, University of Nicosia, Nicosia 2408, Cyprus
| | - Yong Zheng Wee
- Faculty of Computing & Informatics, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia
| | - Ann Zee Gan
- Tenghilan Health Clinic, Tuaran 89208, Sabah, Malaysia; Hospital Universiti Sains Malaysia, 16150 Kota Bharu, Malaysia
| | - Pedro Arcos González
- Unit for Research in Emergency and Disaster, Faculty of Medicine and Health Sciences, University of Oviedo, Oviedo 33006, Spain
| |
Collapse
|
42
|
Indran IR, Paranthaman P, Gupta N, Mustafa N. Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT. MEDICAL TEACHER 2023:1-6. [PMID: 38146711 DOI: 10.1080/0142159x.2023.2294703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 12/11/2023] [Indexed: 12/27/2023]
Abstract
BACKGROUND Crafting quality assessment questions in medical education is a crucial yet time-consuming, expertise-driven undertaking that calls for innovative solutions. Large language models (LLMs), such as ChatGPT (Chat Generative Pre-Trained Transformer), present a promising yet underexplored avenue for such innovations. AIMS This study explores the utility of ChatGPT to generate diverse, high-quality medical questions, focusing on multiple-choice questions (MCQs) as an illustrative example, to increase educator's productivity and enable self-directed learning for students. DESCRIPTION Leveraging 12 strategies, we demonstrate how ChatGPT can be effectively used to generate assessment questions aligned with Bloom's taxonomy and core knowledge domains while promoting best practices in assessment design. CONCLUSION Integrating LLM tools like ChatGPT into generating medical assessment questions like MCQs augments but does not replace human expertise. With continual instruction refinement, AI can produce high-standard questions. Yet, the onus of ensuring ultimate quality and accuracy remains with subject matter experts, affirming the irreplaceable value of human involvement in the artificial intelligence-driven education paradigm.
Collapse
Affiliation(s)
- Inthrani Raja Indran
- Department of Pharmacology, National University of Singapore, Yong Loo Lin School of Medicine, Singapore, Singapore
| | - Priya Paranthaman
- Department of Pharmacology, National University of Singapore, Yong Loo Lin School of Medicine, Singapore, Singapore
| | - Neelima Gupta
- Department of Pharmacology, National University of Singapore, Yong Loo Lin School of Medicine, Singapore, Singapore
| | - Nurulhuda Mustafa
- Department of Pharmacology, National University of Singapore, Yong Loo Lin School of Medicine, Singapore, Singapore
| |
Collapse
|
43
|
Huo B, Cacciamani GE, Collins GS, McKechnie T, Lee Y, Guyatt G. Reporting standards for the use of large language model-linked chatbots for health advice. Nat Med 2023; 29:2988. [PMID: 37957381 DOI: 10.1038/s41591-023-02656-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada.
| | - Giovanni E Cacciamani
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- AI Center at USC Urology, University of Southern California, Los Angeles, CA, USA
| | - Gary S Collins
- Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology & Musculoskeletal Sciences, University of Oxford, Oxford, UK
- UK EQUATOR Centre, University of Oxford, Oxford, UK
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
- Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Gordon Guyatt
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
- Department of Medicine, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
44
|
Alanzi TM, Alzahrani W, Albalawi NS, Allahyani T, Alghamdi A, Al-Zahrani H, Almutairi A, Alzahrani H, Almulhem L, Alanzi N, Al Moarfeg A, Farhah N. Public Awareness of Obesity as a Risk Factor for Cancer in Central Saudi Arabia: Feasibility of ChatGPT as an Educational Intervention. Cureus 2023; 15:e50781. [PMID: 38239542 PMCID: PMC10795720 DOI: 10.7759/cureus.50781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/17/2023] [Indexed: 01/22/2024] Open
Abstract
BACKGROUND While the link between obesity and chronic diseases such as diabetes and cardiovascular disorders is well-documented, there is a growing body of evidence connecting obesity with an increased risk of cancer. However, public awareness of this connection remains limited. STUDY PURPOSE To analyze public awareness of overweight/obesity as a risk factor for cancer and analyze public perceptions on the feasibility of ChatGPT, an artificial intelligence-based conversational agent, as an educational intervention tool. METHODS A mixed-methods approach including deductive quantitative cross-sectional approach to draw precise conclusions based on empirical evidence on public awareness of the link between obesity and cancer; and inductive qualitative approach to interpret public perceptions on using ChatGPT for creating awareness of obesity, cancer and its risk factors was used in this study. Participants included adult residents in Saudi Arabia. A total of 486 individuals and 21 individuals were included in the survey and semi-structured interviews respectively. RESULTS About 65% of the participants are not completely aware of cancer and its risk factors. Significant differences in awareness were observed concerning age groups (p < .0001), socio-economic status (p = .041), and regional distribution (p = .0351). A total of 10 themes were analyzed from the interview data, which included four positive factors (accessibility, personalization, cost-effectiveness, anonymity and privacy, multi-language support) and five negative factors (information inaccuracy, lack of emotional intelligence, dependency and overreliance, data privacy and security, and inability to provide physical support or diagnosis). CONCLUSION This study has underscored the potential of leveraging ChatGPT as a valuable public awareness tool for cancer in Saudi Arabia.
Collapse
Affiliation(s)
- Turki M Alanzi
- Department of Health Information Management and Technology, College of Public Health, Imam Abdulrahman Bin Faisal University, Dammam, SAU
| | - Wala Alzahrani
- Department of Clinical Nutrition, College of Applied Medical Sciences, King Abdulaziz University, Jeddah, SAU
| | | | - Taif Allahyani
- College of Applied Medical Sciences, Umm Al-Qura University, Makkah, SAU
| | | | - Haneen Al-Zahrani
- Department of Hematology, Armed Forces Hospital at King Abdulaziz Airbase Dhahran, Dhahran, SAU
| | - Awatif Almutairi
- Department of Clinical Laboratories Sciences, College of Applied Medical Sciences, Jouf University, Jouf, SAU
| | | | | | - Nouf Alanzi
- Department of Clinical Laboratories Sciences, College of Applied Medical Sciences, Jouf University, Jouf, SAU
| | | | - Nesren Farhah
- Department of Health Informatics, College of Health Sciences, Saudi Electronic University, Riyadh, SAU
| |
Collapse
|
45
|
Zhang C, Xu J, Tang R, Yang J, Wang W, Yu X, Shi S. Novel research and future prospects of artificial intelligence in cancer diagnosis and treatment. J Hematol Oncol 2023; 16:114. [PMID: 38012673 PMCID: PMC10680201 DOI: 10.1186/s13045-023-01514-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Accepted: 11/20/2023] [Indexed: 11/29/2023] Open
Abstract
Research into the potential benefits of artificial intelligence for comprehending the intricate biology of cancer has grown as a result of the widespread use of deep learning and machine learning in the healthcare sector and the availability of highly specialized cancer datasets. Here, we review new artificial intelligence approaches and how they are being used in oncology. We describe how artificial intelligence might be used in the detection, prognosis, and administration of cancer treatments and introduce the use of the latest large language models such as ChatGPT in oncology clinics. We highlight artificial intelligence applications for omics data types, and we offer perspectives on how the various data types might be combined to create decision-support tools. We also evaluate the present constraints and challenges to applying artificial intelligence in precision oncology. Finally, we discuss how current challenges may be surmounted to make artificial intelligence useful in clinical settings in the future.
Collapse
Affiliation(s)
- Chaoyi Zhang
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Jin Xu
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Rong Tang
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Jianhui Yang
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Wei Wang
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Xianjun Yu
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China.
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China.
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China.
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China.
| | - Si Shi
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China.
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China.
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China.
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China.
| |
Collapse
|
46
|
Iannantuono GM, Bracken-Clarke D, Karzai F, Choo-Wosoba H, Gulley JL, Floudas CS. Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.10.31.23297825. [PMID: 38076813 PMCID: PMC10705618 DOI: 10.1101/2023.10.31.23297825] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Background The capability of large language models (LLMs) to understand and generate human-readable text has prompted the investigation of their potential as educational and management tools for cancer patients and healthcare providers. Materials and Methods We conducted a cross-sectional study aimed at evaluating the ability of ChatGPT-4, ChatGPT-3.5, and Google Bard to answer questions related to four domains of immuno-oncology (Mechanisms, Indications, Toxicities, and Prognosis). We generated 60 open-ended questions (15 for each section). Questions were manually submitted to LLMs, and responses were collected on June 30th, 2023. Two reviewers evaluated the answers independently. Results ChatGPT-4 and ChatGPT-3.5 answered all questions, whereas Google Bard answered only 53.3% (p <0.0001). The number of questions with reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT3.5 (88.3%) than for Google Bard (50%) (p <0.0001). In terms of accuracy, the number of answers deemed fully correct were 75.4%, 58.5%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (p = 0.03). Furthermore, the number of responses deemed highly relevant was 71.9%, 77.4%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (p = 0.04). Regarding readability, the number of highly readable was higher for ChatGPT-4 and ChatGPT-3.5 (98.1%) and (100%) compared to Google Bard (87.5%) (p = 0.02). Conclusion ChatGPT-4 and ChatGPT-3.5 are potentially powerful tools in immuno-oncology, whereas Google Bard demonstrated relatively poorer performance. However, the risk of inaccuracy or incompleteness in the responses was evident in all three LLMs, highlighting the importance of expert-driven verification of the outputs returned by these technologies.
Collapse
Affiliation(s)
- Giovanni Maria Iannantuono
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Dara Bracken-Clarke
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Fatima Karzai
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Hyoyoung Choo-Wosoba
- Biostatistics and Data Management Section, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - James L. Gulley
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Charalampos S. Floudas
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
47
|
Hu JM, Liu FC, Chu CM, Chang YT. Health Care Trainees' and Professionals' Perceptions of ChatGPT in Improving Medical Knowledge Training: Rapid Survey Study. J Med Internet Res 2023; 25:e49385. [PMID: 37851495 PMCID: PMC10620632 DOI: 10.2196/49385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 07/13/2023] [Accepted: 09/29/2023] [Indexed: 10/19/2023] Open
Abstract
BACKGROUND ChatGPT is a powerful pretrained large language model. It has both demonstrated potential and raised concerns related to knowledge translation and knowledge transfer. To apply and improve knowledge transfer in the real world, it is essential to assess the perceptions and acceptance of the users of ChatGPT-assisted training. OBJECTIVE We aimed to investigate the perceptions of health care trainees and professionals on ChatGPT-assisted training, using biomedical informatics as an example. METHODS We used purposeful sampling to include all health care undergraduate trainees and graduate professionals (n=195) from January to May 2023 in the School of Public Health at the National Defense Medical Center in Taiwan. Subjects were asked to watch a 2-minute video introducing 5 scenarios about ChatGPT-assisted training in biomedical informatics and then answer a self-designed online (web- and mobile-based) questionnaire according to the Kirkpatrick model. The survey responses were used to develop 4 constructs: "perceived knowledge acquisition," "perceived training motivation," "perceived training satisfaction," and "perceived training effectiveness." The study used structural equation modeling (SEM) to evaluate and test the structural model and hypotheses. RESULTS The online questionnaire response rate was 152 of 195 (78%); 88 of 152 participants (58%) were undergraduate trainees and 90 of 152 participants (59%) were women. The ages ranged from 18 to 53 years (mean 23.3, SD 6.0 years). There was no statistical difference in perceptions of training evaluation between men and women. Most participants were enthusiastic about the ChatGPT-assisted training, while the graduate professionals were more enthusiastic than undergraduate trainees. Nevertheless, some concerns were raised about potential cheating on training assessment. The average scores for knowledge acquisition, training motivation, training satisfaction, and training effectiveness were 3.84 (SD 0.80), 3.76 (SD 0.93), 3.75 (SD 0.87), and 3.72 (SD 0.91), respectively (Likert scale 1-5: strongly disagree to strongly agree). Knowledge acquisition had the highest score and training effectiveness the lowest. In the SEM results, training effectiveness was influenced predominantly by knowledge acquisition and partially met the hypotheses in the research framework. Knowledge acquisition had a direct effect on training effectiveness, training satisfaction, and training motivation, with β coefficients of .80, .87, and .97, respectively (all P<.001). CONCLUSIONS Most health care trainees and professionals perceived ChatGPT-assisted training as an aid in knowledge transfer. However, to improve training effectiveness, it should be combined with empirical experts for proper guidance and dual interaction. In a future study, we recommend using a larger sample size for evaluation of internet-connected large language models in medical knowledge transfer.
Collapse
Affiliation(s)
- Je-Ming Hu
- Division of Colorectal Surgery, Department of Surgery, Tri-service General Hospital, National Defense Medical Center, Taipei, Taiwan
- Graduate Institute of Medical Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Medicine, National Defense Medical Center, Taipei, Taiwan
| | - Feng-Cheng Liu
- Division of Rheumatology/Immunology and Allergy, Department of Medicine, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Chi-Ming Chu
- Graduate Institute of Medical Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan
- Big Data Research Center, College of Medicine, Fu-Jen Catholic University, New Taipei City, Taiwan
- Department of Public Health, Kaohsiung Medical University, Kaohsiung, Taiwan
- Department of Public Health, China Medical University, Taichung, Taiwan
| | - Yu-Tien Chang
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
| |
Collapse
|
48
|
Talyshinskii A, Naik N, Hameed BMZ, Zhanbyrbekuly U, Khairli G, Guliev B, Juilebø-Jones P, Tzelves L, Somani BK. Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surg 2023; 10:1257191. [PMID: 37744723 PMCID: PMC10512827 DOI: 10.3389/fsurg.2023.1257191] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 08/28/2023] [Indexed: 09/26/2023] Open
Abstract
Purpose of review ChatGPT has emerged as a potential tool for facilitating doctors' workflows. However, when it comes to applying these findings within a urological context, there have not been many studies. Thus, our objective was rooted in analyzing the pros and cons of ChatGPT use and how it can be exploited and used by urologists. Recent findings ChatGPT can facilitate clinical documentation and note-taking, patient communication and support, medical education, and research. In urology, it was proven that ChatGPT has the potential as a virtual healthcare aide for benign prostatic hyperplasia, an educational and prevention tool on prostate cancer, educational support for urological residents, and as an assistant in writing urological papers and academic work. However, several concerns about its exploitation are presented, such as lack of web crawling, risk of accidental plagiarism, and concerns about patients-data privacy. Summary The existing limitations mediate the need for further improvement of ChatGPT, such as ensuring the privacy of patient data and expanding the learning dataset to include medical databases, and developing guidance on its appropriate use. Urologists can also help by conducting studies to determine the effectiveness of ChatGPT in urology in clinical scenarios and nosologies other than those previously listed.
Collapse
Affiliation(s)
- Ali Talyshinskii
- Department of Urology, Astana Medical University, Astana, Kazakhstan
| | - Nithesh Naik
- Department of Mechanical and Industrial Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
| | | | | | - Gafur Khairli
- Department of Urology, Astana Medical University, Astana, Kazakhstan
| | - Bakhman Guliev
- Department of Urology, Mariinsky Hospital, St Petersburg, Russia
| | | | - Lazaros Tzelves
- Department of Urology, National and Kapodistrian University of Athens, Sismanogleion Hospital, Athens, Marousi, Greece
| | - Bhaskar Kumar Somani
- Department of Urology, University Hospital Southampton NHS Trust, Southampton, United Kingdom
| |
Collapse
|
49
|
Iannantuono GM, Bracken-Clarke D, Floudas CS, Roselli M, Gulley JL, Karzai F. Applications of large language models in cancer care: current evidence and future perspectives. Front Oncol 2023; 13:1268915. [PMID: 37731643 PMCID: PMC10507617 DOI: 10.3389/fonc.2023.1268915] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 08/21/2023] [Indexed: 09/22/2023] Open
Abstract
The development of large language models (LLMs) is a recent success in the field of generative artificial intelligence (AI). They are computer models able to perform a wide range of natural language processing tasks, including content generation, question answering, or language translation. In recent months, a growing number of studies aimed to assess their potential applications in the field of medicine, including cancer care. In this mini review, we described the present published evidence for using LLMs in oncology. All the available studies assessed ChatGPT, an advanced language model developed by OpenAI, alone or compared to other LLMs, such as Google Bard, Chatsonic, and Perplexity. Although ChatGPT could provide adequate information on the screening or the management of specific solid tumors, it also demonstrated a significant error rate and a tendency toward providing obsolete data. Therefore, an accurate, expert-driven verification process remains mandatory to avoid the potential for misinformation and incorrect evidence. Overall, although this new generative AI-based technology has the potential to revolutionize the field of medicine, including that of cancer care, it will be necessary to develop rules to guide the application of these tools to maximize benefits and minimize risks.
Collapse
Affiliation(s)
- Giovanni Maria Iannantuono
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
- Medical Oncology Unit, Department of Systems Medicine, University of Rome Tor Vergata, Rome, Italy
| | - Dara Bracken-Clarke
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Charalampos S. Floudas
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Mario Roselli
- Medical Oncology Unit, Department of Systems Medicine, University of Rome Tor Vergata, Rome, Italy
| | - James L. Gulley
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Fatima Karzai
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
50
|
Tippareddy C, Jiang S, Bera K, Ramaiya N. Radiology Reading Room for the Future: Harnessing the Power of Large Language Models Like ChatGPT. Curr Probl Diagn Radiol 2023:S0363-0188(23)00133-0. [PMID: 37758604 DOI: 10.1067/j.cpradiol.2023.08.018] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 08/28/2023] [Accepted: 08/28/2023] [Indexed: 09/29/2023]
Abstract
Radiology has usually been the field of medicine that has been at the forefront of technological advances, often being the first to wholeheartedly embrace them. Whether it's from digitization to cloud side architecture, radiology has led the way for adopting the latest advances. With the advent of large language models (LLMs), especially with the unprecedented explosion of freely available ChatGPT, time is ripe for radiology and radiologists to find novel ways to use the technology to improve their workflow. Towards this, we believe these LLMs have a key role in the radiology reading room not only to expedite processes, simplify mundane and archaic tasks, but also to increase the radiologist's and radiologist trainee's knowledge base at a far faster pace. In this article, we discuss some of the ways we believe ChatGPT, and the likes can be harnessed in the reading room.
Collapse
Affiliation(s)
- Charit Tippareddy
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Sirui Jiang
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Kaustav Bera
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH.
| | - Nikhil Ramaiya
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| |
Collapse
|