1
|
Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, Freire Y. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2024; 24:46-52. [PMID: 38162955 PMCID: PMC10755495 DOI: 10.1016/j.csbj.2023.11.058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/03/2024] Open
Abstract
AI has revolutionized the way we interact with technology. Noteworthy advances in AI algorithms and large language models (LLM) have led to the development of natural generative language (NGL) systems such as ChatGPT. Although these LLM can simulate human conversations and generate content in real time, they face challenges related to the topicality and accuracy of the information they generate. This study aimed to assess whether ChatGPT-4 could provide accurate and reliable answers to general dentists in the field of oral surgery, and thus explore its potential as an intelligent virtual assistant in clinical decision making in oral surgery. Thirty questions related to oral surgery were posed to ChatGPT4, each question repeated 30 times. Subsequently, a total of 900 responses were obtained. Two surgeons graded the answers according to the guidelines of the Spanish Society of Oral Surgery, using a three-point Likert scale (correct, partially correct/incomplete, and incorrect). Disagreements were arbitrated by an experienced oral surgeon, who provided the final grade Accuracy was found to be 71.7%, and consistency of the experts' grading across iterations, ranged from moderate to almost perfect. ChatGPT-4, with its potential capabilities, will inevitably be integrated into dental disciplines, including oral surgery. In the future, it could be considered as an auxiliary intelligent virtual assistant, though it would never replace oral surgery experts. Proper training and verified information by experts will remain vital to the implementation of the technology. More comprehensive research is needed to ensure the safe and successful application of AI in oral surgery.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Jaime Jiménez
- Department of Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - María Llorente de Pedro
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Cristina Andreu-Vázquez
- Department of Veterinary Medicine, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Víctor Díaz-Flores García
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Margarita Gómez Sánchez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| |
Collapse
|
2
|
Turan Eİ, Baydemir AE, Özcan FG, Şahin AS. Evaluating the accuracy of ChatGPT-4 in predicting ASA scores: A prospective multicentric study ChatGPT-4 in ASA score prediction. J Clin Anesth 2024; 96:111475. [PMID: 38657530 DOI: 10.1016/j.jclinane.2024.111475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 04/18/2024] [Indexed: 04/26/2024]
Abstract
BACKGROUND This study investigates the potential of ChatGPT-4, developed by OpenAI, in enhancing medical decision-making processes, particularly in preoperative assessments using the American Society of Anesthesiologists (ASA) scoring system. The ASA score, a critical tool in evaluating patients' health status and anesthesia risks before surgery, categorizes patients from I to VI based on their overall health and risk factors. Despite its widespread use, determining accurate ASA scores remains a subjective process that may benefit from AI-supported assessments. This research aims to evaluate ChatGPT-4's capability to predict ASA scores accurately compared to expert anesthesiologists' assessments. METHODS In this prospective multicentric study, ethical board approval was obtained, and the study was registered with clinicaltrials.gov (NCT06321445). We included 2851 patients from anesthesiology outpatient clinics, spanning neonates to all age groups and genders, with ASA scores between I-IV. Exclusion criteria were set for ASA V and VI scores, emergency operations, and insufficient information for ASA score determination. Data on patients' demographics, health conditions, and ASA scores by anesthesiologists were collected and anonymized. ChatGPT-4 was then tasked with assigning ASA scores based on the standardized patient data. RESULTS Our results indicate a high level of concordance between ChatGPT-4 predictions and anesthesiologists' evaluations, with Cohen's kappa analysis showing a kappa value of 0.858 (p = 0.000). While the model demonstrated over 90% accuracy in predicting ASA scores I to III, it showed a notable variance in ASA IV scores, suggesting a potential limitation in assessing patients with more complex health conditions. DISCUSSION The findings suggest that ChatGPT-4 can significantly contribute to the medical field by supporting anesthesiologists in preoperative assessments. This study not only demonstrates ChatGPT-4's efficacy in medical data analysis and decision-making but also opens new avenues for AI applications in healthcare, particularly in enhancing patient safety and optimizing surgical outcomes. Further research is needed to refine AI models for complex case assessments and integrate them seamlessly into clinical workflows.
Collapse
Affiliation(s)
- Engin İhsan Turan
- Department of Anesthesiology, Istanbul Health Science University Kanuni Sultan Süleyman Education and Training Hospital, Istanbul, Turkey.
| | | | - Funda Gümüş Özcan
- Department of Anesthesiology, Basaksehir Cam ve Sakura City Hospital, Istanbul, Turkey
| | - Ayça Sultan Şahin
- Department of Anesthesiology, Istanbul Health Science University Kanuni Sultan Süleyman Education and Training Hospital, Istanbul, Turkey
| |
Collapse
|
3
|
Karacan E. Evaluating the Quality of Postpartum Hemorrhage Nursing Care Plans Generated by Artificial Intelligence Models. J Nurs Care Qual 2024; 39:206-211. [PMID: 38701406 DOI: 10.1097/ncq.0000000000000766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/05/2024]
Abstract
BACKGROUND With the rapidly advancing technological landscape of health care, evaluating the potential use of artificial intelligence (AI) models to prepare nursing care plans is of great importance. PURPOSE The purpose of this study was to evaluate the quality of nursing care plans created by AI for the management of postpartum hemorrhage (PPH). METHODS This cross-sectional exploratory study involved creating a scenario for an imaginary patient with PPH. Information was put into 3 AI platforms (GPT-4, LaMDA, Med-PaLM) on consecutive days without prior conversation. Care plans were evaluated using the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) scale. RESULTS Med-PaLM exhibited superior quality in developing the care plan compared with LaMDA ( Z = 4.354; P = .000) and GPT-4 ( Z = 3.126; P = .029). CONCLUSIONS Our findings suggest that despite the strong performance of Med-PaLM, AI, in its current state, is unsuitable for use with real patients.
Collapse
Affiliation(s)
- Emine Karacan
- Dortyol Vocational School of Health Services, Iskenderun Technical University, Hatay, Turkey
| |
Collapse
|
4
|
Levin C, Kagan T, Rosen S, Saban M. An evaluation of the capabilities of language models and nurses in providing neonatal clinical decision support. Int J Nurs Stud 2024; 155:104771. [PMID: 38688103 DOI: 10.1016/j.ijnurstu.2024.104771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 03/26/2024] [Accepted: 04/03/2024] [Indexed: 05/02/2024]
Abstract
AIM To assess the clinical reasoning capabilities of two large language models, ChatGPT-4 and Claude-2.0, compared to those of neonatal nurses during neonatal care scenarios. DESIGN A cross-sectional study with a comparative evaluation using a survey instrument that included six neonatal intensive care unit clinical scenarios. PARTICIPANTS 32 neonatal intensive care nurses with 5-10 years of experience working in the neonatal intensive care units of three medical centers. METHODS Participants responded to 6 written clinical scenarios. Simultaneously, we asked ChatGPT-4 and Claude-2.0 to provide initial assessments and treatment recommendations for the same scenarios. The responses from ChatGPT-4 and Claude-2.0 were then scored by certified neonatal nurse practitioners for accuracy, completeness, and response time. RESULTS Both models demonstrated capabilities in clinical reasoning for neonatal care, with Claude-2.0 significantly outperforming ChatGPT-4 in clinical accuracy and speed. However, limitations were identified across the cases in diagnostic precision, treatment specificity, and response lag. CONCLUSIONS While showing promise, current limitations reinforce the need for deep refinement before ChatGPT-4 and Claude-2.0 can be considered for integration into clinical practice. Additional validation of these tools is important to safely leverage this Artificial Intelligence technology for enhancing clinical decision-making. IMPACT The study provides an understanding of the reasoning accuracy of new Artificial Intelligence models in neonatal clinical care. The current accuracy gaps of ChatGPT-4 and Claude-2.0 need to be addressed prior to clinical usage.
Collapse
Affiliation(s)
- Chedva Levin
- Faculty of School of Life and Health Sciences, Nursing Department, The Jerusalem College of Technology-Lev Academic Center, Jerusalem, Israel; The Department of Vascular Surgery, The Chaim Sheba Medical Center, Tel Hashomer, Ramat Gan, Tel Aviv, Israel
| | | | - Shani Rosen
- Department of Nursing, School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Israel
| | - Mor Saban
- Department of Nursing, School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Israel.
| |
Collapse
|
5
|
Alkhalaf M, Yu P, Yin M, Deng C. Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. J Biomed Inform 2024:104662. [PMID: 38880236 DOI: 10.1016/j.jbi.2024.104662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 05/25/2024] [Accepted: 05/28/2024] [Indexed: 06/18/2024]
Abstract
BACKGROUND Malnutrition is a prevalent issue in aged care facilities (RACFs), leading to adverse health outcomes. The ability to efficiently extract key clinical information from a large volume of data in electronic health records (EHR) can improve understanding about the extent of the problem and developing effective interventions. This research aimed to test the efficacy of zero-shot prompt engineering applied to generative artificial intelligence (AI) models on their own and in combination with retrieval augmented generation (RAG), for the automating tasks of summarizing both structured and unstructured data in EHR and extracting important malnutrition information. METHODOLOGY We utilized Llama 2 13B model with zero-shot prompting. The dataset comprises unstructured and structured EHRs related to malnutrition management in 40 Australian RACFs. We employed zero-shot learning to the model alone first, then combined it with RAG to accomplish two tasks: generate structured summaries about the nutritional status of a client and extract key information about malnutrition risk factors. We utilized 25 notes in the first task and 1,399 in the second task. We evaluated the model's output of each task manually against a gold standard dataset. RESULT The evaluation outcomes indicated that zero-shot learning applied to generative AI model is highly effective in summarizing and extracting information about nutritional status of RACFs' clients. The generated summaries provided concise and accurate representation of the original data with an overall accuracy of 93.25%. The addition of RAG improved the summarization process, leading to a 6% increase and achieving an accuracy of 99.25%. The model also proved its capability in extracting risk factors with an accuracy of 90%. However, adding RAG did not further improve accuracy in this task. Overall, the model has shown a robust performance when information was explicitly stated in the notes; however, it could encounter hallucination limitations, particularly when details were not explicitly provided. CONCLUSION This study demonstrates the high performance and limitations of applying zero-shot learning to generative AI models to automatic generation of structured summarization of EHRs data and extracting key clinical information. The inclusion of the RAG approach improved the model performance and mitigated the hallucination problem.
Collapse
Affiliation(s)
- Mohammad Alkhalaf
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522, Australia; School of Computer Science, Qassim University, Qassim 51452, Saudi Arabia
| | - Ping Yu
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522, Australia.
| | - Mengyang Yin
- Opal Healthcare, Level 11/420 George St, Sydney NSW 2000, Australia
| | - Chao Deng
- School of Medical, Indigenous and Health Sciences, University of Wollongong, Wollongong, NSW 2522, Australia
| |
Collapse
|
6
|
Puerto Nino AK, Garcia Perez V, Secco S, De Nunzio C, Lombardo R, Tikkinen KAO, Elterman DS. Can ChatGPT provide high-quality patient information on male lower urinary tract symptoms suggestive of benign prostate enlargement? Prostate Cancer Prostatic Dis 2024:10.1038/s41391-024-00847-7. [PMID: 38871841 DOI: 10.1038/s41391-024-00847-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 05/03/2024] [Accepted: 05/10/2024] [Indexed: 06/15/2024]
Abstract
BACKGROUND ChatGPT has recently emerged as a novel resource for patients' disease-specific inquiries. There is, however, limited evidence assessing the quality of the information. We evaluated the accuracy and quality of the ChatGPT's responses on male lower urinary tract symptoms (LUTS) suggestive of benign prostate enlargement (BPE) when compared to two reference resources. METHODS Using patient information websites from the European Association of Urology and the American Urological Association as reference material, we formulated 88 BPE-centric questions for ChatGPT 4.0+. Independently and in duplicate, we compared the ChatGPT's responses and the reference material, calculating accuracy through F1 score, precision, and recall metrics. We used a 5-point Likert scale for quality rating. We evaluated examiner agreement using the interclass correlation coefficient and assessed the difference in the quality scores with the Wilcoxon signed-rank test. RESULTS ChatGPT addressed all (88/88) LUTS/BPE-related questions. For the 88 questions, the recorded F1 score was 0.79 (range: 0-1), precision 0.66 (range: 0-1), recall 0.97 (range: 0-1), and the quality score had a median of 4 (range = 1-5). Examiners had a good level of agreement (ICC = 0.86). We found no statistically significant difference between the scores given by the examiners and the overall quality of the responses (p = 0.72). DISCUSSION ChatGPT demostrated a potential utility in educating patients about BPE/LUTS, its prognosis, and treatment that helps in the decision-making process. One must exercise prudence when recommending this as the sole information outlet. Additional studies are needed to completely understand the full extent of AI's efficacy in delivering patient education in urology.
Collapse
Affiliation(s)
- Angie K Puerto Nino
- Faculty of Medicine, University of Helsinki, Helsinki, Finland.
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada.
| | | | - Silvia Secco
- Department of Urology, Niguarda Hospital, Milan, Italy
| | - Cosimo De Nunzio
- Urology Unit, Ospedale Sant'Andrea, La Sapienza University of Rome, Rome, Italy
| | - Riccardo Lombardo
- Urology Unit, Ospedale Sant'Andrea, La Sapienza University of Rome, Rome, Italy
| | - Kari A O Tikkinen
- Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Urology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
- Department of Surgery, South Karelian Central Hospital, Lappeenranta, Finland
- Department of Health Research Methods, Evidence and Impact, McMaster University, Hamilton, ON, Canada
| | - Dean S Elterman
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
7
|
Wu Y, Wu M, Wang C, Lin J, Liu J, Liu S. Evaluating the Prevalence of Burnout Among Health Care Professionals Related to Electronic Health Record Use: Systematic Review and Meta-Analysis. JMIR Med Inform 2024; 12:e54811. [PMID: 38865188 DOI: 10.2196/54811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 02/23/2024] [Accepted: 04/17/2024] [Indexed: 06/13/2024] Open
Abstract
BACKGROUND Burnout among health care professionals is a significant concern, with detrimental effects on health care service quality and patient outcomes. The use of the electronic health record (EHR) system has been identified as a significant contributor to burnout among health care professionals. OBJECTIVE This systematic review and meta-analysis aims to assess the prevalence of burnout among health care professionals associated with the use of the EHR system, thereby providing evidence to improve health information systems and develop strategies to measure and mitigate burnout. METHODS We conducted a comprehensive search of the PubMed, Embase, and Web of Science databases for English-language peer-reviewed articles published between January 1, 2009, and December 31, 2022. Two independent reviewers applied inclusion and exclusion criteria, and study quality was assessed using the Joanna Briggs Institute checklist and the Newcastle-Ottawa Scale. Meta-analyses were performed using R (version 4.1.3; R Foundation for Statistical Computing), with EndNote X7 (Clarivate) for reference management. RESULTS The review included 32 cross-sectional studies and 5 case-control studies with a total of 66,556 participants, mainly physicians and registered nurses. The pooled prevalence of burnout among health care professionals in cross-sectional studies was 40.4% (95% CI 37.5%-43.2%). Case-control studies indicated a higher likelihood of burnout among health care professionals who spent more time on EHR-related tasks outside work (odds ratio 2.43, 95% CI 2.31-2.57). CONCLUSIONS The findings highlight the association between the increased use of the EHR system and burnout among health care professionals. Potential solutions include optimizing EHR systems, implementing automated dictation or note-taking, employing scribes to reduce documentation burden, and leveraging artificial intelligence to enhance EHR system efficiency and reduce the risk of burnout. TRIAL REGISTRATION PROSPERO International Prospective Register of Systematic Reviews CRD42021281173; https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281173.
Collapse
Affiliation(s)
- Yuxuan Wu
- Department of Medical Informatics, West China Hospital, Sichuan University, Chengdu, China
| | - Mingyue Wu
- Information Center, West China Hospital, Sichuan University, Chengdu, China
| | - Changyu Wang
- West China College of Stomatology, Sichuan University, Chengdu, China
| | - Jie Lin
- Department of Oral Implantology, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Jialin Liu
- Department of Medical Informatics, West China Hospital, Sichuan University, Chengdu, China
- Information Center, West China Hospital, Sichuan University, Chengdu, China
| | - Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
8
|
Elwyn G, Ryan P, Blumkin D, Weeks WB. Meet generative AI… your new shared decision-making assistant. BMJ Evid Based Med 2024:bmjebm-2023-112651. [PMID: 38866469 DOI: 10.1136/bmjebm-2023-112651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/25/2024] [Indexed: 06/14/2024]
Affiliation(s)
- Glyn Elwyn
- The Dartmouth Institute for Health Policy and Clinical Practice, Dartmouth College, Hanover, New Hampshire, USA
| | | | | | | |
Collapse
|
9
|
Moura L, Jones DT, Sheikh IS, Murphy S, Kalfin M, Kummer BR, Weathers AL, Grinspan ZM, Silsbee HM, Jones LK, Patel AD. Implications of Large Language Models for Quality and Efficiency of Neurologic Care: Emerging Issues in Neurology. Neurology 2024; 102:e209497. [PMID: 38759131 DOI: 10.1212/wnl.0000000000209497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024] Open
Abstract
Large language models (LLMs) are advanced artificial intelligence (AI) systems that excel in recognizing and generating human-like language, possibly serving as valuable tools for neurology-related information tasks. Although LLMs have shown remarkable potential in various areas, their performance in the dynamic environment of daily clinical practice remains uncertain. This article outlines multiple limitations and challenges of using LLMs in clinical settings that need to be addressed, including limited clinical reasoning, variable reliability and accuracy, reproducibility bias, self-serving bias, sponsorship bias, and potential for exacerbating health care disparities. These challenges are further compounded by practical business considerations and infrastructure requirements, including associated costs. To overcome these hurdles and harness the potential of LLMs effectively, this article includes considerations for health care organizations, researchers, and neurologists contemplating the use of LLMs in clinical practice. It is essential for health care organizations to cultivate a culture that welcomes AI solutions and aligns them seamlessly with health care operations. Clear objectives and business plans should guide the selection of AI solutions, ensuring they meet organizational needs and budget considerations. Engaging both clinical and nonclinical stakeholders can help secure necessary resources, foster trust, and ensure the long-term sustainability of AI implementations. Testing, validation, training, and ongoing monitoring are pivotal for successful integration. For neurologists, safeguarding patient data privacy is paramount. Seeking guidance from institutional information technology resources for informed, compliant decisions, and remaining vigilant against biases in LLM outputs are essential practices in responsible and unbiased utilization of AI tools. In research, obtaining institutional review board approval is crucial when dealing with patient data, even if deidentified, to ensure ethical use. Compliance with established guidelines like SPIRIT-AI, MI-CLAIM, and CONSORT-AI is necessary to maintain consistency and mitigate biases in AI research. In summary, the integration of LLMs into clinical neurology offers immense promise while presenting formidable challenges. Awareness of these considerations is vital for harnessing the potential of AI in neurologic care effectively and enhancing patient care quality and safety. The article serves as a guide for health care organizations, researchers, and neurologists navigating this transformative landscape.
Collapse
Affiliation(s)
- Lidia Moura
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - David T Jones
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Irfan S Sheikh
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Shawn Murphy
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Michael Kalfin
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Benjamin R Kummer
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Allison L Weathers
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Zachary M Grinspan
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Heather M Silsbee
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Lyell K Jones
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| | - Anup D Patel
- From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
| |
Collapse
|
10
|
Roldan-Vasquez E, Mitri S, Bhasin S, Bharani T, Capasso K, Haslinger M, Sharma R, James TA. Reliability of artificial intelligence chatbot responses to frequently asked questions in breast surgical oncology. J Surg Oncol 2024. [PMID: 38837375 DOI: 10.1002/jso.27715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Accepted: 05/21/2024] [Indexed: 06/07/2024]
Abstract
INTRODUCTION Artificial intelligence (AI)-driven chatbots, capable of simulating human-like conversations, are becoming more prevalent in healthcare. While this technology offers potential benefits in patient engagement and information accessibility, it raises concerns about potential misuse, misinformation, inaccuracies, and ethical challenges. METHODS This study evaluated a publicly available AI chatbot, ChatGPT, in its responses to nine questions related to breast cancer surgery selected from the American Society of Breast Surgeons' frequently asked questions (FAQ) patient education website. Four breast surgical oncologists assessed the responses for accuracy and reliability using a five-point Likert scale and the Patient Education Materials Assessment (PEMAT) Tool. RESULTS The average reliability score for ChatGPT in answering breast cancer surgery questions was 3.98 out of 5.00. Surgeons unanimously found the responses understandable and actionable per the PEMAT criteria. The consensus found ChatGPT's overall performance was appropriate, with minor or no inaccuracies. CONCLUSION ChatGPT demonstrates good reliability in responding to breast cancer surgery queries, with minor, nonharmful inaccuracies. Its answers are accurate, clear, and easy to comprehend. Notably, ChatGPT acknowledged its informational role and did not attempt to replace medical advice or discourage users from seeking input from a healthcare professional.
Collapse
Affiliation(s)
- Estefania Roldan-Vasquez
- Department of Surgery, Breast Surgical Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Samir Mitri
- Department of Surgery, Breast Surgical Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Shreya Bhasin
- Department of Surgery, Breast Surgical Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
- School of Medicine and Dentistry, University of Rochester, Rochester, New York, USA
| | - Tina Bharani
- Department of Surgery, Breast Surgical Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
- Department of Surgery, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Kathryn Capasso
- Department of Surgery, Breast Surgical Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Michelle Haslinger
- Department of Surgery, Breast Surgical Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Ranjna Sharma
- Department of Surgery, Breast Surgical Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Ted A James
- Department of Surgery, Breast Surgical Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
11
|
Harada Y, Suzuki T, Harada T, Sakamoto T, Ishizuka K, Miyagami T, Kawamura R, Kunitomo K, Nagano H, Shimizu T, Watari T. Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors. BMJ Open Qual 2024; 13:e002654. [PMID: 38830730 PMCID: PMC11149143 DOI: 10.1136/bmjoq-2023-002654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors. OBJECTIVE This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations. METHODS We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians. RESULTS ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP. CONCLUSION ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.
Collapse
Affiliation(s)
- Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | | | - Taku Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
- Nerima Hikarigaoka Hospital, Nerima-ku, Tokyo, Japan
| | - Tetsu Sakamoto
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | - Kosuke Ishizuka
- Yokohama City University School of Medicine Graduate School of Medicine, Yokohama, Kanagawa, Japan
| | - Taiju Miyagami
- Department of General Medicine, Faculty of Medicine, Juntendo University, Bunkyo-ku, Tokyo, Japan
| | - Ren Kawamura
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | | | - Hiroyuki Nagano
- Department of General Internal Medicine, Tenri Hospital, Tenri, Nara, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | - Takashi Watari
- Integrated Clinical Education Center, Kyoto University Hospital, Kyoto, Kyoto, Japan
| |
Collapse
|
12
|
Balasanjeevi G, Surapaneni KM. Comparison of ChatGPT version 3.5 & 4 for utility in respiratory medicine education using clinical case scenarios. Respir Med Res 2024; 85:101091. [PMID: 38657295 DOI: 10.1016/j.resmer.2024.101091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 02/14/2024] [Accepted: 02/15/2024] [Indexed: 04/26/2024]
Abstract
Integration of ChatGPT in Respiratory medicine presents a promising avenue for enhancing clinical practice and pedagogical approaches. This study compares the performance of ChatGPT version 3.5 and 4 in respiratory medicine, emphasizing its potential in clinical decision support and medical education using clinical cases. Results indicate moderate performance highlighting limitations in handling complex case scenarios. Compared to ChatGPT 3.5, version 4 showed greater promise as a pedagogical tool, providing interactive learning experiences. While serving as a preliminary decision support tool clinically, caution is advised, stressing the need for ongoing validation. Future research should refine its clinical capabilities for optimal integration into medical education and practice.
Collapse
Affiliation(s)
- Gayathri Balasanjeevi
- Department of Tuberculosis & Respiratory Diseases, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai 600 123, Tamil Nadu, India
| | - Krishna Mohan Surapaneni
- Department of Biochemistry, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, 600 123 Tamil Nadu, India; Department of Medical Education, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, 600 123 Tamil Nadu, India.
| |
Collapse
|
13
|
Shen SA, Perez-Heydrich CA, Xie DX, Nellis JC. ChatGPT vs. web search for patient questions: what does ChatGPT do better? Eur Arch Otorhinolaryngol 2024; 281:3219-3225. [PMID: 38416195 DOI: 10.1007/s00405-024-08524-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 01/31/2024] [Indexed: 02/29/2024]
Abstract
PURPOSE Chat generative pretrained transformer (ChatGPT) has the potential to significantly impact how patients acquire medical information online. Here, we characterize the readability and appropriateness of ChatGPT responses to a range of patient questions compared to results from traditional web searches. METHODS Patient questions related to the published Clinical Practice Guidelines by the American Academy of Otolaryngology-Head and Neck Surgery were sourced from existing online posts. Questions were categorized using a modified Rothwell classification system into (1) fact, (2) policy, and (3) diagnosis and recommendations. These were queried using ChatGPT and traditional web search. All results were evaluated on readability (Flesch Reading Ease and Flesch-Kinkaid Grade Level) and understandability (Patient Education Materials Assessment Tool). Accuracy was assessed by two blinded clinical evaluators using a three-point ordinal scale. RESULTS 54 questions were organized into fact (37.0%), policy (37.0%), and diagnosis (25.8%). The average readability for ChatGPT responses was lower than traditional web search (FRE: 42.3 ± 13.1 vs. 55.6 ± 10.5, p < 0.001), while the PEMAT understandability was equivalent (93.8% vs. 93.5%, p = 0.17). ChatGPT scored higher than web search for questions the 'Diagnosis' category (p < 0.01); there was no difference in questions categorized as 'Fact' (p = 0.15) or 'Policy' (p = 0.22). Additional prompting improved ChatGPT response readability (FRE 55.6 ± 13.6, p < 0.01). CONCLUSIONS ChatGPT outperforms web search in answering patient questions related to symptom-based diagnoses and is equivalent in providing medical facts and established policy. Appropriate prompting can further improve readability while maintaining accuracy. Further patient education is needed to relay the benefits and limitations of this technology as a source of medial information.
Collapse
Affiliation(s)
- Sarek A Shen
- Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins School of Medicine, 601 North Caroline Street, Baltimore, MD, 21287, USA.
| | | | - Deborah X Xie
- Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins School of Medicine, 601 North Caroline Street, Baltimore, MD, 21287, USA
| | - Jason C Nellis
- Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins School of Medicine, 601 North Caroline Street, Baltimore, MD, 21287, USA
| |
Collapse
|
14
|
Koga S, Du W. Integrating AI in medicine: Lessons from Chat-GPT's limitations in medical imaging. Dig Liver Dis 2024; 56:1114-1115. [PMID: 38429138 DOI: 10.1016/j.dld.2024.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Accepted: 02/19/2024] [Indexed: 03/03/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, United States.
| | - Wei Du
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, United States
| |
Collapse
|
15
|
Baldwin AJ. An artificial intelligence language model improves readability of burns first aid information. Burns 2024; 50:1122-1127. [PMID: 38492982 DOI: 10.1016/j.burns.2024.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 01/29/2024] [Accepted: 03/05/2024] [Indexed: 03/18/2024]
Abstract
AIMS This study aimed to assess the potential of using an artificial intelligence (AI) large language model to improve the readability of burns first aid information. METHODS An AI language model (ChatGPT-3) was used to rewrite content from the top 50 English-language webpages containing burns first aid information to be understandable by an individual with the literacy level of an 11-year-old, as recommended by the American Medical Association and Health Education England. The assessment of readability was conducted using five validated tools. RESULTS In their original form, only 4% of the patient education materials (PEMs) met the target readability level across all tools. The median grade was 6.9 (SD=1.1). One-sample one-tailed t-test revealed that this was not significantly below the target (p = .31). After AI-modification, 18% of PEMs reached the target level using all tools, with a median grade of 6 (SD=0.9), which was significantly below the target level (p < .001). Once rewritten using AI, paired t-test demonstrated that all readability scores improved significantly (p < .001). CONCLUSION Utilising an AI language model proved an effective and viable method for enhancing readability of burns first aid information.
Collapse
Affiliation(s)
- Alexander J Baldwin
- Department of Burns and Plastic Surgery, Buckinghamshire Healthcare NHS Trust, Buckinghamshire, UK.
| |
Collapse
|
16
|
Lee VV, van der Lubbe SCC, Goh LH, Valderas JM. Harnessing ChatGPT for Thematic Analysis: Are We Ready? J Med Internet Res 2024; 26:e54974. [PMID: 38819896 PMCID: PMC11179012 DOI: 10.2196/54974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 02/28/2024] [Accepted: 03/20/2024] [Indexed: 06/01/2024] Open
Abstract
ChatGPT (OpenAI) is an advanced natural language processing tool with growing applications across various disciplines in medical research. Thematic analysis, a qualitative research method to identify and interpret patterns in data, is one application that stands to benefit from this technology. This viewpoint explores the use of ChatGPT in three core phases of thematic analysis within a medical context: (1) direct coding of transcripts, (2) generating themes from a predefined list of codes, and (3) preprocessing quotes for manuscript inclusion. Additionally, we explore the potential of ChatGPT to generate interview transcripts, which may be used for training purposes. We assess the strengths and limitations of using ChatGPT in these roles, highlighting areas where human intervention remains necessary. Overall, we argue that ChatGPT can function as a valuable tool during analysis, enhancing the efficiency of the thematic analysis and offering additional insights into the qualitative data. While ChatGPT may not adequately capture the full context of each participant, it can serve as an additional member of the analysis team, contributing to researcher triangulation through knowledge building and sensemaking.
Collapse
Affiliation(s)
- V Vien Lee
- Division of Family Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Stephanie C C van der Lubbe
- Division of Family Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Lay Hoon Goh
- Division of Family Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Family Medicine, National University Health System, Singapore, Singapore
| | - Jose Maria Valderas
- Division of Family Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Family Medicine, National University Health System, Singapore, Singapore
- Centre for Research in Health Systems Performance, National University of Singapore, Singapore, Singapore
| |
Collapse
|
17
|
Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Cuny C, Snijders JP, Ernst BP, Blaikie A, Kelsey T, Kuhn S, Eckrich J. Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology. Acta Otolaryngol 2024:1-6. [PMID: 38781053 DOI: 10.1080/00016489.2024.2352843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 05/03/2024] [Indexed: 05/25/2024]
Abstract
BACKGROUND Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear. AIMS/OBJECTIVES Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL). MATERIAL AND METHODS Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared. RESULTS LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants. CONCLUSIONS AND SIGNIFICANCE Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.
Collapse
Affiliation(s)
- Christoph R Buhr
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
- School of Medicine, University of St Andrews, St Andrews, UK
| | - Harry Smith
- School of Computer Science, University of St Andrews, St Andrews, UK
| | - Tilman Huppertz
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Katharina Bahr-Hamm
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Christoph Matthias
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Clemens Cuny
- Outpatient Clinic, Clemens Cuny, Dieburg, Germany
| | | | | | - Andrew Blaikie
- School of Medicine, University of St Andrews, St Andrews, UK
| | - Tom Kelsey
- School of Computer Science, University of St Andrews, St Andrews, UK
| | - Sebastian Kuhn
- Institute for Digital Medicine, Philipps-University Marburg, University Hospital of Giessen and Marburg, Marburg, Germany
| | - Jonas Eckrich
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| |
Collapse
|
18
|
Liu S, McCoy AB, Wright AP, Nelson SD, Huang SS, Ahmad HB, Carro SE, Franklin J, Brogan J, Wright A. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc 2024; 31:1388-1396. [PMID: 38452289 PMCID: PMC11105133 DOI: 10.1093/jamia/ocae041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 02/06/2024] [Accepted: 02/21/2024] [Indexed: 03/09/2024] Open
Abstract
OBJECTIVES To evaluate the capability of using generative artificial intelligence (AI) in summarizing alert comments and to determine if the AI-generated summary could be used to improve clinical decision support (CDS) alerts. MATERIALS AND METHODS We extracted user comments to alerts generated from September 1, 2022 to September 1, 2023 at Vanderbilt University Medical Center. For a subset of 8 alerts, comment summaries were generated independently by 2 physicians and then separately by GPT-4. We surveyed 5 CDS experts to rate the human-generated and AI-generated summaries on a scale from 1 (strongly disagree) to 5 (strongly agree) for the 4 metrics: clarity, completeness, accuracy, and usefulness. RESULTS Five CDS experts participated in the survey. A total of 16 human-generated summaries and 8 AI-generated summaries were assessed. Among the top 8 rated summaries, five were generated by GPT-4. AI-generated summaries demonstrated high levels of clarity, accuracy, and usefulness, similar to the human-generated summaries. Moreover, AI-generated summaries exhibited significantly higher completeness and usefulness compared to the human-generated summaries (AI: 3.4 ± 1.2, human: 2.7 ± 1.2, P = .001). CONCLUSION End-user comments provide clinicians' immediate feedback to CDS alerts and can serve as a direct and valuable data resource for improving CDS delivery. Traditionally, these comments may not be considered in the CDS review process due to their unstructured nature, large volume, and the presence of redundant or irrelevant content. Our study demonstrates that GPT-4 is capable of distilling these comments into summaries characterized by high clarity, accuracy, and completeness. AI-generated summaries are equivalent and potentially better than human-generated summaries. These AI-generated summaries could provide CDS experts with a novel means of reviewing user comments to rapidly optimize CDS alerts both online and offline.
Collapse
Affiliation(s)
- Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
| | - Allison B McCoy
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Aileen P Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Scott D Nelson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Sean S Huang
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Hasan B Ahmad
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195, United States
| | - Sabrina E Carro
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Jacob Franklin
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - James Brogan
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Adam Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| |
Collapse
|
19
|
Liu S, McCoy AB, Wright AP, Carew B, Genkins JZ, Huang SS, Peterson JF, Steitz B, Wright A. Leveraging large language models for generating responses to patient messages-a subjective analysis. J Am Med Inform Assoc 2024; 31:1367-1379. [PMID: 38497958 PMCID: PMC11105129 DOI: 10.1093/jamia/ocae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/17/2024] [Accepted: 02/28/2024] [Indexed: 03/19/2024] Open
Abstract
OBJECTIVE This study aimed to develop and assess the performance of fine-tuned large language models for generating responses to patient messages sent via an electronic health record patient portal. MATERIALS AND METHODS Utilizing a dataset of messages and responses extracted from the patient portal at a large academic medical center, we developed a model (CLAIR-Short) based on a pre-trained large language model (LLaMA-65B). In addition, we used the OpenAI API to update physician responses from an open-source dataset into a format with informative paragraphs that offered patient education while emphasizing empathy and professionalism. By combining with this dataset, we further fine-tuned our model (CLAIR-Long). To evaluate fine-tuned models, we used 10 representative patient portal questions in primary care to generate responses. We asked primary care physicians to review generated responses from our models and ChatGPT and rated them for empathy, responsiveness, accuracy, and usefulness. RESULTS The dataset consisted of 499 794 pairs of patient messages and corresponding responses from the patient portal, with 5000 patient messages and ChatGPT-updated responses from an online platform. Four primary care physicians participated in the survey. CLAIR-Short exhibited the ability to generate concise responses similar to provider's responses. CLAIR-Long responses provided increased patient educational content compared to CLAIR-Short and were rated similarly to ChatGPT's responses, receiving positive evaluations for responsiveness, empathy, and accuracy, while receiving a neutral rating for usefulness. CONCLUSION This subjective analysis suggests that leveraging large language models to generate responses to patient messages demonstrates significant potential in facilitating communication between patients and healthcare providers.
Collapse
Affiliation(s)
- Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Allison B McCoy
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Aileen P Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Babatunde Carew
- Department of General Internal Medicine and Public Health, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Julian Z Genkins
- Department of Medicine, Stanford University, Stanford, CA 94304, United States
| | - Sean S Huang
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Josh F Peterson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Bryan Steitz
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Adam Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| |
Collapse
|
20
|
Bridges JM. Computerized diagnostic decision support systems - a comparative performance study of Isabel Pro vs. ChatGPT4. Diagnosis (Berl) 2024; 0:dx-2024-0033. [PMID: 38709491 DOI: 10.1515/dx-2024-0033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 04/22/2024] [Indexed: 05/07/2024]
Abstract
OBJECTIVES Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro. METHODS This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis's rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis. RESULTS Both systems produced comprehensive differential diagnosis lists. Isabel Pro's list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %). CONCLUSIONS This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.
Collapse
Affiliation(s)
- Joe M Bridges
- D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, USA
| |
Collapse
|
21
|
Makhoul M, Melkane AE, Khoury PE, Hadi CE, Matar N. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol 2024; 281:2717-2721. [PMID: 38365990 DOI: 10.1007/s00405-024-08509-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 01/24/2024] [Indexed: 02/18/2024]
Abstract
PURPOSE With recent advances in artificial intelligence (AI), it has become crucial to thoroughly evaluate its applicability in healthcare. This study aimed to assess the accuracy of ChatGPT in diagnosing ear, nose, and throat (ENT) pathology, and comparing its performance to that of medical experts. METHODS We conducted a cross-sectional comparative study where 32 ENT cases were presented to ChatGPT 3.5, ENT physicians, ENT residents, family medicine (FM) specialists, second-year medical students (Med2), and third-year medical students (Med3). Each participant provided three differential diagnoses. The study analyzed diagnostic accuracy rates and inter-rater agreement within and between participant groups and ChatGPT. RESULTS The accuracy rate of ChatGPT was 70.8%, being not significantly different from ENT physicians or ENT residents. However, a significant difference in correctness rate existed between ChatGPT and FM specialists (49.8%, p < 0.001), and between ChatGPT and medical students (Med2 47.5%, p < 0.001; Med3 47%, p < 0.001). Inter-rater agreement for the differential diagnosis between ChatGPT and each participant group was either poor or fair. In 68.75% of cases, ChatGPT failed to mention the most critical diagnosis. CONCLUSIONS ChatGPT demonstrated accuracy comparable to that of ENT physicians and ENT residents in diagnosing ENT pathology, outperforming FM specialists, Med2 and Med3. However, it showed limitations in identifying the most critical diagnosis.
Collapse
Affiliation(s)
- Mikhael Makhoul
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon.
| | - Antoine E Melkane
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Patrick El Khoury
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Christopher El Hadi
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Nayla Matar
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| |
Collapse
|
22
|
Temsah MH, Jamal A, Alhasan K, Aljamaan F, Altamimi I, Malki KH, Temsah A, Ohannessian R, Al-Eyadhy A. Transforming Virtual Healthcare: The Potentials of ChatGPT-4omni in Telemedicine. Cureus 2024; 16:e61377. [PMID: 38817799 PMCID: PMC11139454 DOI: 10.7759/cureus.61377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/30/2024] [Indexed: 06/01/2024] Open
Abstract
The introduction of OpenAI's ChatGPT-4omni (GPT-4o) represents a potential advancement in virtual healthcare and telemedicine. GPT-4o excels in processing audio, visual, and textual data in real time, offering possible enhancements in understanding natural language in both English and non-English contexts. Furthermore, the new "Temporary Chat" feature may improve privacy and data confidentiality during interactions, potentially increasing integration with healthcare systems. These innovations promise to enhance communication clarity, facilitate the integration of medical images, and increase data privacy in online consultations. This editorial explores some future implications of these advancements for telemedicine, highlighting the necessity for further research on reliability and the integration of advanced language models with human expertise.
Collapse
Affiliation(s)
- Mohamad-Hani Temsah
- Pediatric Intensive Care Unit, Pediatric Department, King Saud University Medical City, College of Medicine, King Saud University, Riyadh, SAU
| | - Amr Jamal
- Family and Community Medicine, King Saud University, Riyadh, SAU
| | | | - Fadi Aljamaan
- Critical Care Department, College of Medicine, King Saud University, Riyadh, SAU
| | | | - Khalid H Malki
- Department of Otolaryngology, College of Medicine, King Saud University, Riyadh, SAU
| | - Abdulrahman Temsah
- Software Engineering Department, College of Engineering, Alfaisal University, Riyadh, SAU
| | | | - Ayman Al-Eyadhy
- Department of Pediatrics, Pediatric Intensive Care Unit, College of Medicine, King Saud University, Riyadh, SAU
- Pediatric Intensive Care Unit, King Saud University Medical City, Riyadh, SAU
| |
Collapse
|
23
|
Yilmaz Muluk S. Enhancing Musculoskeletal Injection Safety: Evaluating Checklists Generated by Artificial Intelligence and Revising the Preformed Checklist. Cureus 2024; 16:e59708. [PMID: 38841023 PMCID: PMC11150897 DOI: 10.7759/cureus.59708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/02/2024] [Indexed: 06/07/2024] Open
Abstract
Background Musculoskeletal disorders are a significant global health issue, necessitating advanced management strategies such as intra-articular and extra-articular injections to alleviate pain, inflammation, and mobility challenges. As the adoption of these interventions by physicians grows, the importance of robust safety protocols becomes paramount. This study evaluates the effectiveness of conversational artificial intelligence (AI), particularly versions 3.5 and 4 of Chat Generative Pre-trained Transformer (ChatGPT), in creating patient safety checklists for managing musculoskeletal injections to enhance the preparation of safety documentation. Methodology A quantitative analysis was conducted to evaluate AI-generated safety checklists against a preformed checklist adapted from reputable medical sources. Adherence of the generated checklists to the preformed checklist was calculated and classified. The Wilcoxon signed-rank test was used to assess the performance differences between ChatGPT versions 3.5 and 4. Results ChatGPT-4 showed superior adherence to the preformed checklist compared to ChatGPT-3.5, with both versions classified as very good in safety protocol creation. Although no significant differences were present in the sign-in and sign-out parts of the checklists of both versions, ChatGPT-4 had significantly higher scores in the procedure planning part (p = 0.007), and its overall performance was also higher (p < 0.001). Subsequently, the preformed checklist was revised to incorporate new contributions from ChatGPT. Conclusions ChatGPT, especially version 4, proved effective in generating patient safety checklists for musculoskeletal injections, highlighting the potential of AI to streamline clinical practices. Further enhancements are necessary to fully meet the medical standards.
Collapse
|
24
|
Miao J, Thongprayoon C, Fülöp T, Cheungpasitporn W. Enhancing clinical decision-making: Optimizing ChatGPT's performance in hypertension care. J Clin Hypertens (Greenwich) 2024; 26:588-593. [PMID: 38646920 PMCID: PMC11088425 DOI: 10.1111/jch.14822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 03/27/2024] [Accepted: 03/28/2024] [Indexed: 04/23/2024]
Affiliation(s)
- Jing Miao
- Division of NephrologyDepartment of Medicine, Mayo ClinicRochesterMinnesotaUSA
| | - Charat Thongprayoon
- Division of NephrologyDepartment of Medicine, Mayo ClinicRochesterMinnesotaUSA
| | - Tibor Fülöp
- Division of NephrologyDepartment of Medicine, Medical University of South CarolinaCharlestonSouth CarolinaUSA
- Medicine ServiceRalph H. Johnson VA Medical CenterCharlestonSouth CarolinaUSA
| | | |
Collapse
|
25
|
Ferdush J, Begum M, Hossain ST. ChatGPT and Clinical Decision Support: Scope, Application, and Limitations. Ann Biomed Eng 2024; 52:1119-1124. [PMID: 37516680 DOI: 10.1007/s10439-023-03329-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Accepted: 07/18/2023] [Indexed: 07/31/2023]
Abstract
This study examines ChatGPT's role in clinical decision support, by analyzing its scope, application, and limitations. By analyzing patient data and providing evidence-based recommendations, ChatGPT, an AI language model, can help healthcare professionals make well-informed decisions. This study examines ChatGPT's use in clinical decision support, including diagnosis and treatment planning. However, it acknowledges limitations like biases, lack of contextual understanding, and human oversight and also proposes a framework for the future clinical decision support system. Understanding these factors will allow healthcare professionals to utilize ChatGPT effectively and make accurate clinical decisions. Further research is needed to understand the implications of using ChatGPT in healthcare settings and to develop safeguards for responsible use.
Collapse
Affiliation(s)
- Jannatul Ferdush
- Department of Computer Science and Engineering, Jashore University of Science and Technology, Jashore, 7408, Bangladesh.
| | - Mahbuba Begum
- Department of Computer Science and Engineering, Mawlana Bhasani Science and Technology, Tangail, 1902, Bangladesh
| | - Sakib Tanvir Hossain
- Department of Mechanical Engineering, Khulna University of Engineering and Technology, Khulna, 9203, Bangladesh
| |
Collapse
|
26
|
Safrai M, Azaria A. Does small talk with a medical provider affect ChatGPT's medical counsel? Performance of ChatGPT on USMLE with and without distractions. PLoS One 2024; 19:e0302217. [PMID: 38687696 PMCID: PMC11060598 DOI: 10.1371/journal.pone.0302217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 03/28/2024] [Indexed: 05/02/2024] Open
Abstract
Efforts are being made to improve the time effectiveness of healthcare providers. Artificial intelligence tools can help transcript and summarize physician-patient encounters and produce medical notes and medical recommendations. However, in addition to medical information, discussion between healthcare and patients includes small talk and other information irrelevant to medical concerns. As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple-choice and open-ended questions. First, we gathered small talk sentences from human participants using the Mechanical Turk platform. Second, both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. Finally, a board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data (66.8% vs. 56.6%; p = 0.025). Specifically, for multiple-choice questions (72.1% vs. 68.9%; p = 0.67) and for the open questions (61.5% vs. 44.3%; p = 0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6% and 66.2%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.
Collapse
Affiliation(s)
- Myriam Safrai
- Department of Obstetrics and Gynecology, Chaim Sheba Medical Center (Tel Hashomer), Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States of America
| | - Amos Azaria
- School of Computer Science, Ariel University, Ari’el, Israel
| |
Collapse
|
27
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Background The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Ellen Wright Clayton
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
| | - Bradley A. Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
- Department of Biostatistics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| |
Collapse
|
28
|
Quttainah M, Mishra V, Madakam S, Lurie Y, Mark S. Cost, Usability, Credibility, Fairness, Accountability, Transparency, and Explainability Framework for Safe and Effective Large Language Models in Medical Education: Narrative Review and Qualitative Study. JMIR AI 2024; 3:e51834. [PMID: 38875562 PMCID: PMC11077408 DOI: 10.2196/51834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 12/20/2023] [Accepted: 02/03/2024] [Indexed: 06/16/2024]
Abstract
BACKGROUND The world has witnessed increased adoption of large language models (LLMs) in the last year. Although the products developed using LLMs have the potential to solve accessibility and efficiency problems in health care, there is a lack of available guidelines for developing LLMs for health care, especially for medical education. OBJECTIVE The aim of this study was to identify and prioritize the enablers for developing successful LLMs for medical education. We further evaluated the relationships among these identified enablers. METHODS A narrative review of the extant literature was first performed to identify the key enablers for LLM development. We additionally gathered the opinions of LLM users to determine the relative importance of these enablers using an analytical hierarchy process (AHP), which is a multicriteria decision-making method. Further, total interpretive structural modeling (TISM) was used to analyze the perspectives of product developers and ascertain the relationships and hierarchy among these enablers. Finally, the cross-impact matrix-based multiplication applied to a classification (MICMAC) approach was used to determine the relative driving and dependence powers of these enablers. A nonprobabilistic purposive sampling approach was used for recruitment of focus groups. RESULTS The AHP demonstrated that the most important enabler for LLMs was credibility, with a priority weight of 0.37, followed by accountability (0.27642) and fairness (0.10572). In contrast, usability, with a priority weight of 0.04, showed negligible importance. The results of TISM concurred with the findings of the AHP. The only striking difference between expert perspectives and user preference evaluation was that the product developers indicated that cost has the least importance as a potential enabler. The MICMAC analysis suggested that cost has a strong influence on other enablers. The inputs of the focus group were found to be reliable, with a consistency ratio less than 0.1 (0.084). CONCLUSIONS This study is the first to identify, prioritize, and analyze the relationships of enablers of effective LLMs for medical education. Based on the results of this study, we developed a comprehendible prescriptive framework, named CUC-FATE (Cost, Usability, Credibility, Fairness, Accountability, Transparency, and Explainability), for evaluating the enablers of LLMs in medical education. The study findings are useful for health care professionals, health technology experts, medical technology regulators, and policy makers.
Collapse
Affiliation(s)
- Majdi Quttainah
- College of Business Administration, Kuwait University, Kuwait, Kuwait
| | - Vinaytosh Mishra
- College of Healthcare Management and Economics, Gulf Medical University, Ajman, United Arab Emirates
| | - Somayya Madakam
- Information Technology, Birla Institute of Management Technology, Knowledge Park - II, Greater Noida, India
| | - Yotam Lurie
- Department of Management, Ben-Gurion University, Negev, Israel
| | - Shlomo Mark
- Department of Software Engineering, Shamoon College of Engineering, Ashdod, Israel
| |
Collapse
|
29
|
Rosselló-Jiménez D, Docampo S, Collado Y, Cuadra-Llopart L, Riba F, Llonch-Masriera M. Geriatrics and artificial intelligence in Spain (Ger-IA project): talking to ChatGPT, a nationwide survey. Eur Geriatr Med 2024:10.1007/s41999-024-00970-7. [PMID: 38615289 DOI: 10.1007/s41999-024-00970-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 03/04/2024] [Indexed: 04/15/2024]
Abstract
PURPOSE The purposes of the study was to describe the degree of agreement between geriatricians with the answers given by an AI tool (ChatGPT) in response to questions related to different areas in geriatrics, to study the differences between specialists and residents in geriatrics in terms of the degree of agreement with ChatGPT, and to analyse the mean scores obtained by areas of knowledge/domains. METHODS An observational study was conducted involving 126 doctors from 41 geriatric medicine departments in Spain. Ten questions about geriatric medicine were posed to ChatGPT, and doctors evaluated the AI's answers using a Likert scale. Sociodemographic variables were included. Questions were categorized into five knowledge domains, and means and standard deviations were calculated for each. RESULTS 130 doctors answered the questionnaire. 126 doctors (69.8% women, mean age 41.4 [9.8]) were included in the final analysis. The mean score obtained by ChatGPT was 3.1/5 [0.67]. Specialists rated ChatGPT lower than residents (3.0/5 vs. 3.3/5 points, respectively, P < 0.05). By domains, ChatGPT scored better (M: 3.96; SD: 0.71) in general/theoretical questions rather than in complex decisions/end-of-life situations (M: 2.50; SD: 0.76) and answers related to diagnosis/performing of complementary tests obtained the lowest ones (M: 2.48; SD: 0.77). CONCLUSION Scores presented big variability depending on the area of knowledge. Questions related to theoretical aspects of challenges/future in geriatrics obtained better scores. When it comes to complex decision-making, appropriateness of the therapeutic efforts or decisions about diagnostic tests, professionals indicated a poorer performance. AI is likely to be incorporated into some areas of medicine, but it would still present important limitations, mainly in complex medical decision-making.
Collapse
Affiliation(s)
- Daniel Rosselló-Jiménez
- Geriatric Medicine Department, Hospital Universitari de Terrassa, Consorci Sanitari de Terrassa, Carr. Torrebonica, s/n, Terrassa, 08227, Barcelona, Spain.
| | - S Docampo
- Geriatric Medicine Department, Hospital Santa Creu, Tortosa, Tortosa, Tarragona, Spain
| | - Y Collado
- Geriatric Medicine Department, Hospital Universitari de Terrassa, Consorci Sanitari de Terrassa, Carr. Torrebonica, s/n, Terrassa, 08227, Barcelona, Spain
| | - L Cuadra-Llopart
- Geriatric Medicine Department, Hospital Universitari de Terrassa, Consorci Sanitari de Terrassa, Carr. Torrebonica, s/n, Terrassa, 08227, Barcelona, Spain
- Faculty of Medicine and Health Sciences, Universitat Internacional de Catalunya (UIC), Barcelona, Spain
- ACTIUM Functional Anatomy Group, Universitat Internacional de Catalunya (UIC), Barcelona, Spain
| | - F Riba
- Geriatric Medicine Department, Hospital Santa Creu, Tortosa, Tortosa, Tarragona, Spain
| | - M Llonch-Masriera
- Geriatric Medicine Department, Hospital Universitari de Terrassa, Consorci Sanitari de Terrassa, Carr. Torrebonica, s/n, Terrassa, 08227, Barcelona, Spain
- Faculty of Medicine and Health Sciences, Universitat Internacional de Catalunya (UIC), Barcelona, Spain
| |
Collapse
|
30
|
Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, Shimizu T. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration. JMIR Med Inform 2024; 12:e55627. [PMID: 38592758 DOI: 10.2196/55627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 02/14/2024] [Accepted: 03/13/2024] [Indexed: 04/10/2024] Open
Abstract
BACKGROUND In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear. OBJECTIVE This study aims to assess the impact of adding image data on ChatGPT-4's diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data. METHODS We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis. RESULTS The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V's performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ2 test). Additionally, ChatGPT-4's self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases. CONCLUSIONS Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine.
Collapse
Affiliation(s)
- Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Kazuki Tokumasu
- Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan
| | | | - Tomoharu Suzuki
- Department of Hospital Medicine, Urasoe General Hospital, Okinawa, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| |
Collapse
|
31
|
Pavlovic ZJ, Jiang VS, Hariton E. Current applications of artificial intelligence in assisted reproductive technologies through the perspective of a patient's journey. Curr Opin Obstet Gynecol 2024:00001703-990000000-00122. [PMID: 38597425 DOI: 10.1097/gco.0000000000000951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
PURPOSE OF REVIEW This review highlights the timely relevance of artificial intelligence in enhancing assisted reproductive technologies (ARTs), particularly in-vitro fertilization (IVF). It underscores artificial intelligence's potential in revolutionizing patient outcomes and operational efficiency by addressing challenges in fertility diagnoses and procedures. RECENT FINDINGS Recent advancements in artificial intelligence, including machine learning and predictive modeling, are making significant strides in optimizing IVF processes such as medication dosing, scheduling, and embryological assessments. Innovations include artificial intelligence augmented diagnostic testing, predictive modeling for treatment outcomes, scheduling optimization, dosing and protocol selection, follicular and hormone monitoring, trigger timing, and improved embryo selection. These developments promise to refine treatment approaches, enhance patient engagement, and increase the accuracy and scalability of fertility treatments. SUMMARY The integration of artificial intelligence into reproductive medicine offers profound implications for clinical practice and research. By facilitating personalized treatment plans, standardizing procedures, and improving the efficiency of fertility clinics, artificial intelligence technologies pave the way for value-based, accessible, and efficient fertility services. Despite the promise, the full potential of artificial intelligence in ART will require ongoing validation and ethical considerations to ensure equitable and effective implementation.
Collapse
Affiliation(s)
- Zoran J Pavlovic
- Department of Obstetrics and Gynecology/Reproductive Endocrinology and Infertility, University of South Florida, Morsani College of Medicine, Tampa, Florida
| | - Victoria S Jiang
- Division of Reproductive Endocrinology & Infertility, Vincent Department of Obstetrics and Gynecology, Massachusetts General Hospital/Harvard Medical School, Boston, Massachusetts
| | - Eduardo Hariton
- Reproductive Science Center of the San Francisco Bay Area, San Ramon, California, USA
| |
Collapse
|
32
|
Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope 2024. [PMID: 38563415 DOI: 10.1002/lary.31434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/05/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
OBJECTIVES Evaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology. METHODS We collected all 100 clinical vignettes from the second edition of Otolaryngology Cases-The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt "Provide a diagnosis given the following history," we prompted ChatGPT-3.5, Google Bard, and Bing-GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023. RESULTS ChatGPT-3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi-squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non-contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT-3.5, 88.17% for Google Bard, and 78.72% for Bing-GPT4 (p = 0.002). CONCLUSIONS ChatGPT-3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing-GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible "hallucinations" and misinformation in responses. LEVEL OF EVIDENCE 3 Laryngoscope, 2024.
Collapse
Affiliation(s)
- Akshay Warrier
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Rohan Singh
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Afash Haleem
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Haider Zaki
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Jean Anderson Eloy
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
- Center for Skull Base and Pituitary Surgery, Neurological Institute of New Jersey, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| |
Collapse
|
33
|
Teixeira-Marques F, Medeiros N, Nazaré F, Alves S, Lima N, Ribeiro L, Gama R, Oliveira P. Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study. Eur Arch Otorhinolaryngol 2024; 281:2023-2030. [PMID: 38345613 DOI: 10.1007/s00405-024-08498-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 01/23/2024] [Indexed: 03/16/2024]
Abstract
PURPOSE Since the beginning of 2023, ChatGPT emerged as a hot topic in healthcare research. The potential to be a valuable tool in clinical practice is compelling, particularly in improving clinical decision support by helping physicians to make clinical decisions based on the best medical knowledge available. We aim to investigate ChatGPT's ability to identify, diagnose and manage patients with otorhinolaryngology-related symptoms. METHODS A prospective, cross-sectional study was designed based on an idea suggested by ChatGPT to assess the level of agreement between ChatGPT and five otorhinolaryngologists (ENTs) in 20 reality-inspired clinical cases. The clinical cases were presented to the chatbot on two different occasions (ChatGPT-1 and ChatGPT-2) to assess its temporal stability. RESULTS The mean score of ChatGPT-1 was 4.4 (SD 1.2; min 1, max 5) and of ChatGPT-2 was 4.15 (SD 1.3; min 1, max 5), while the ENTs mean score was 4.91 (SD 0.3; min 3, max 5). The Mann-Whitney U test revealed a statistically significant difference (p < 0.001) between both ChatGPT's and the ENTs's score. ChatGPT-1 and ChatGPT-2 gave different answers in five occasions. CONCLUSIONS Artificial intelligence will be an important instrument in clinical decision-making in the near future and ChatGPT is the most promising chatbot so far. Despite needing further development to be used with safety, there is room for improvement and potential to aid otorhinolaryngology residents and specialists in making the most correct decision for the patient.
Collapse
Affiliation(s)
- Francisco Teixeira-Marques
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal.
| | - Nuno Medeiros
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Francisco Nazaré
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Sandra Alves
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Nuno Lima
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Leandro Ribeiro
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Rita Gama
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Pedro Oliveira
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| |
Collapse
|
34
|
Li Z, Shu Y, Tang Y, Lai W. A commentary on 'Auxiliary use of ChatGPT in surgical diagnosis and treatment' (Int J Surg 109 (2023) 3940-3943). Int J Surg 2024; 110:2492-2493. [PMID: 38241341 PMCID: PMC11020027 DOI: 10.1097/js9.0000000000001126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 01/08/2024] [Indexed: 01/21/2024]
Affiliation(s)
- Ziwei Li
- Chongqing University FuLing Hospital, Chongqing, People's Republic of China
| | | | | | | |
Collapse
|
35
|
Javid M, Bhandari M, Parameshwari P, Reddiboina M, Prasad S. Evaluation of ChatGPT for Patient Counseling in Kidney Stone Clinic: A Prospective Study. J Endourol 2024; 38:377-383. [PMID: 38411835 DOI: 10.1089/end.2023.0571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/28/2024] Open
Abstract
Introduction: The potential of large language models (LLMs) is to improve the clinical workflow and to make patient care efficient. We prospectively evaluated the performance of the LLM ChatGPT as a patient counseling tool in the urology stone clinic and validated the generated responses with those of urologists. Methods: We collected 61 questions from 12 kidney stone patients and prompted those to ChatGPT and a panel of experienced urologists (Level 1). Subsequently, the blinded responses of urologists and ChatGPT were presented to two expert urologists (Level 2) for comparative evaluation on preset domains: accuracy, relevance, empathy, completeness, and practicality. All responses were rated on a Likert scale of 1 to 10 for psychometric response evaluation. The mean difference in the scores given by the urologists (Level 2) was analyzed and interrater reliability (IRR) for the level of agreement in the responses between the urologists (Level 2) was analyzed by Cohen's kappa. Results: The mean differences in average scores between the responses from ChatGPT and urologists showed significant differences in accuracy (p < 0.001), empathy (p < 0.001), completeness (p < 0.001), and practicality (p < 0.001), except for the relevance domain (p = 0.051), with ChatGPT's responses being rated higher. The IRR analysis revealed significant agreement only in the empathy domain [k = 0.163, (0.059-0.266)]. Conclusion: We believe the introduction of ChatGPT in the clinical workflow could further optimize the information provided to patients in a busy stone clinic. In this preliminary study, ChatGPT supplemented the answers provided by the urologists, adding value to the conversation. However, in its current state, it is still not ready to be a direct source of authentic information for patients. We recommend its use as a source to build a comprehensive Frequently Asked Questions bank as a prelude to developing an LLM Chatbot for patient counseling.
Collapse
Affiliation(s)
- Mohamed Javid
- Department of Urology, Chengalpattu Medical College, Chengalpattu, Tamil Nadu, India
| | - Mahendra Bhandari
- Vattikuti Urology Institute, Henry Ford Hospital, Detroit, Michigan, USA
| | - P Parameshwari
- Department of Community Medicine, Chengalpattu Medical College, Chengalpattu, Tamil Nadu, India
| | | | - Srikala Prasad
- Department of Urology, Chengalpattu Medical College, Chengalpattu, Tamil Nadu, India
| |
Collapse
|
36
|
Caglayan A, Slusarczyk W, Rabbani RD, Ghose A, Papadopoulos V, Boussios S. Large Language Models in Oncology: Revolution or Cause for Concern? Curr Oncol 2024; 31:1817-1830. [PMID: 38668040 PMCID: PMC11049602 DOI: 10.3390/curroncol31040137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 03/13/2024] [Accepted: 03/29/2024] [Indexed: 04/28/2024] Open
Abstract
The technological capability of artificial intelligence (AI) continues to advance with great strength. Recently, the release of large language models has taken the world by storm with concurrent excitement and concern. As a consequence of their impressive ability and versatility, their provide a potential opportunity for implementation in oncology. Areas of possible application include supporting clinical decision making, education, and contributing to cancer research. Despite the promises that these novel systems can offer, several limitations and barriers challenge their implementation. It is imperative that concerns, such as accountability, data inaccuracy, and data protection, are addressed prior to their integration in oncology. As the progression of artificial intelligence systems continues, new ethical and practical dilemmas will also be approached; thus, the evaluation of these limitations and concerns will be dynamic in nature. This review offers a comprehensive overview of the potential application of large language models in oncology, as well as concerns surrounding their implementation in cancer care.
Collapse
Affiliation(s)
- Aydin Caglayan
- Department of Medical Oncology, Medway NHS Foundation Trust, Gillingham ME7 5NY, UK; (A.C.); (R.D.R.); (A.G.)
| | | | - Rukhshana Dina Rabbani
- Department of Medical Oncology, Medway NHS Foundation Trust, Gillingham ME7 5NY, UK; (A.C.); (R.D.R.); (A.G.)
| | - Aruni Ghose
- Department of Medical Oncology, Medway NHS Foundation Trust, Gillingham ME7 5NY, UK; (A.C.); (R.D.R.); (A.G.)
- Department of Medical Oncology, Barts Cancer Centre, St Bartholomew’s Hospital, Barts Heath NHS Trust, London EC1A 7BE, UK
- Department of Medical Oncology, Mount Vernon Cancer Centre, East and North Hertfordshire Trust, London HA6 2RN, UK
- Health Systems and Treatment Optimisation Network, European Cancer Organisation, 1040 Brussels, Belgium
- Oncology Council, Royal Society of Medicine, London W1G 0AE, UK
| | | | - Stergios Boussios
- Department of Medical Oncology, Medway NHS Foundation Trust, Gillingham ME7 5NY, UK; (A.C.); (R.D.R.); (A.G.)
- Kent Medway Medical School, University of Kent, Canterbury CT2 7LX, UK;
- Faculty of Life Sciences & Medicine, School of Cancer & Pharmaceutical Sciences, King’s College London, Strand Campus, London WC2R 2LS, UK
- Faculty of Medicine, Health, and Social Care, Canterbury Christ Church University, Canterbury CT2 7PB, UK
- AELIA Organization, 9th Km Thessaloniki—Thermi, 57001 Thessaloniki, Greece
| |
Collapse
|
37
|
Deniz MS, Guler BY. Assessment of ChatGPT's adherence to ETA-thyroid nodule management guideline over two different time intervals 14 days apart: in binary and multiple-choice queries. Endocrine 2024:10.1007/s12020-024-03750-2. [PMID: 38489133 DOI: 10.1007/s12020-024-03750-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 02/15/2024] [Indexed: 03/17/2024]
Abstract
OBJECTIVE Artificial intelligence (AI) has significant potential in healthcare, particularly in providing decision-support in specialized domains like thyroid nodule management. This study assesses the effectiveness of ChatGPT-v4, an advanced AI model, in aligning with the European Thyroid Association (ETA) - 2023 guidelines. METHODS The study utilized a structured questionnaire comprising 100 questions, divided into true/false and multiple-choice formats, reflecting real-world clinical scenarios in thyroid nodule management. These questions encompassed diagnostic criteria, treatment options, follow-up protocols, and patient counseling. ChatGPT response was evaluated for accuracy, consistency, and comprehensiveness using a six-point Likert scale. The assessment occurred initially and was repeated after 14 days. RESULTS In the binary queries, the AI model showed an ability to correct some initially incorrect responses. However, there was a noticeable regression in certain responses. 8 of the 11 previously non-compliant responses remained unchanged, while 3 non-compliant responses were rectified. Conversely, 6 initially compliant answers transitioned to non-compliance after 14 days. In multiple-choice queries, the AI's performance was more consistent. A majority of the responses, 43 (86% of the total), were initially correct and maintained their correctness upon re-assessment. However, 4 responses that were initially incorrect remained unchanged, and 3 correct responses shifted to non-compliance over time. CONCLUSION ChatGPT exhibited improving potential as a clinical support tool in thyroid nodule management altgouh it showed varied performance for binary and multiple-choice questions. CLINICAL TRIAL REGISTRATION N/A.
Collapse
Affiliation(s)
- Muzaffer Serdar Deniz
- Department of Endocrinology, Sincan Education and Research Hospital, Ankara, Turkey.
| | | |
Collapse
|
38
|
Cil G, Dogan K. The efficacy of artificial intelligence in urology: a detailed analysis of kidney stone-related queries. World J Urol 2024; 42:158. [PMID: 38483582 PMCID: PMC10940482 DOI: 10.1007/s00345-024-04847-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 01/24/2024] [Indexed: 03/17/2024] Open
Abstract
PURPOSE The study aimed to assess the efficacy of OpenAI's advanced AI model, ChatGPT, in diagnosing urological conditions, focusing on kidney stones. MATERIALS AND METHODS A set of 90 structured questions, compliant with EAU Guidelines 2023, was curated by seasoned urologists for this investigation. We evaluated ChatGPT's performance based on the accuracy and completeness of its responses to two types of questions [binary (true/false) and descriptive (multiple-choice)], stratified into difficulty levels: easy, moderate, and complex. Furthermore, we analyzed the model's learning and adaptability capacity by reassessing the initially incorrect responses after a 2 week interval. RESULTS The model demonstrated commendable accuracy, correctly answering 80% of binary questions (n:45) and 93.3% of descriptive questions (n:45). The model's performance showed no significant variation across different question difficulty levels, with p-values of 0.548 for accuracy and 0.417 for completeness, respectively. Upon reassessment of initially 12 incorrect responses (9 binary to 3 descriptive) after two weeks, ChatGPT's accuracy showed substantial improvement. The mean accuracy score significantly increased from 1.58 ± 0.51 to 2.83 ± 0.93 (p = 0.004), underlining the model's ability to learn and adapt over time. CONCLUSION These findings highlight the potential of ChatGPT in urological diagnostics, but also underscore areas requiring enhancement, especially in the completeness of responses to complex queries. The study endorses AI's incorporation into healthcare, while advocating for prudence and professional supervision in its application.
Collapse
Affiliation(s)
- Gökhan Cil
- Department of Urology, Bagcilar Training and Research Hospital, University of Health Sciences, Istanbul, Turkey.
| | - Kazim Dogan
- Department of Urology, Faculty of Medicine, Istinye University, Istanbul, Turkey
| |
Collapse
|
39
|
Bains SS, Dubin JA, Hameed D, Sax OC, Douglas S, Mont MA, Nace J, Delanois RE. Use and Application of Large Language Models for Patient Questions Following Total Knee Arthroplasty. J Arthroplasty 2024:S0883-5403(24)00233-X. [PMID: 38490569 DOI: 10.1016/j.arth.2024.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 03/06/2024] [Accepted: 03/07/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND A consumer-focused health care model not only allows unprecedented access to information, but equally warrants consideration of the appropriateness of providing accurate patient health information. Nurses play a large role in influencing patient satisfaction following total knee arthroplasty (TKA), but they come at a cost. A specific natural language artificial intelligence (AI) model, ChatGPT (Chat Generative Pre-trained Transformer), has accumulated over 100 million users within months of launching. As such, we aimed to compare: (1) orthopaedic surgeons' evaluation of the appropriateness of the answers to the most frequently asked patient questions after TKA; and (2) patients' comfort level in answering their postoperative questions by using answers provided by arthroplasty-trained nurses and ChatGPT. METHODS We prospectively created 60 questions based on the most commonly asked patient questions following TKA. There were 3 fellowship-trained surgeons who assessed the answers provided by arthroplasty-trained nurses and ChatGPT-4 to each of the questions. The surgeons graded each set of responses based on clinical judgment as: (1) "appropriate," (2) "inappropriate" if the response contained inappropriate information, or (3) "unreliable," if the responses provided inconsistent content. Patients' comfort level and trust in AI were assessed using Research Electronic Data Capture (REDCap) hosted at our local hospital. RESULTS The surgeons graded 44 out of 60 (73.3%) responses for the arthroplasty-trained nurses and 44 out of 60 (73.3%) for ChatGPT to be "appropriate." There were 4 responses graded "inappropriate" and one response graded "unreliable" provided by the nurses. For the ChatGPT response, there were 5 responses graded "inappropriate" and no responses graded "unreliable." There were 136 patients (53.8%) who were more comfortable with the answers provided by ChatGPT compared to 86 patients (34.0%) who preferred the answers from arthroplasty-trained nurses. Of the 253 patients, 233 (92.1%) were uncertain if they would trust AI to answer their postoperative questions. There were 127 patients (50.2%) who answered that if they knew the previous answer was provided by ChatGPT, their comfort level in trusting the answer would change. CONCLUSIONS One potential use of ChatGPT can be found in providing appropriate answers to patient questions after TKA. At our institution, cost expenditures can potentially be minimized while maintaining patient satisfaction. Inevitably, successful implementation is dependent on the ability to provide information that is credible and in accordance with the objectives of both physicians and patients. LEVEL OF EVIDENCE III.
Collapse
Affiliation(s)
- Sandeep S Bains
- Rubin Institute for Advanced Orthopedics, LifeBridge Health, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - Jeremy A Dubin
- Rubin Institute for Advanced Orthopedics, LifeBridge Health, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - Daniel Hameed
- Rubin Institute for Advanced Orthopedics, LifeBridge Health, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - Oliver C Sax
- Rubin Institute for Advanced Orthopedics, LifeBridge Health, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - Scott Douglas
- Rubin Institute for Advanced Orthopedics, LifeBridge Health, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - Michael A Mont
- Rubin Institute for Advanced Orthopedics, LifeBridge Health, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - James Nace
- Rubin Institute for Advanced Orthopedics, LifeBridge Health, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - Ronald E Delanois
- Rubin Institute for Advanced Orthopedics, LifeBridge Health, Sinai Hospital of Baltimore, Baltimore, Maryland
| |
Collapse
|
40
|
Park YJ, Pillai A, Deng J, Guo E, Gupta M, Paget M, Naugler C. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak 2024; 24:72. [PMID: 38475802 DOI: 10.1186/s12911-024-02459-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 02/12/2024] [Indexed: 03/14/2024] Open
Abstract
IMPORTANCE Large language models (LLMs) like OpenAI's ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base. OBJECTIVE This scoping review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs' clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications. EVIDENCE REVIEW We screened 4,036 records from MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations. FINDINGS Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs' effectiveness and feasibility. CONCLUSIONS AND RELEVANCE This review thus highlights potential future directions and questions to address these limitations and to further explore LLMs' potential in enhancing healthcare delivery.
Collapse
Affiliation(s)
- Ye-Jean Park
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Cir, M5S 1A8, Toronto, ON, Canada.
| | - Abhinav Pillai
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Jiawen Deng
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Cir, M5S 1A8, Toronto, ON, Canada
| | - Eddie Guo
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Mehul Gupta
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Mike Paget
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Christopher Naugler
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| |
Collapse
|
41
|
Lee Y, Kim SY. Potential applications of ChatGPT in obstetrics and gynecology in Korea: a review article. Obstet Gynecol Sci 2024; 67:153-159. [PMID: 38247132 PMCID: PMC10948210 DOI: 10.5468/ogs.23231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/08/2023] [Accepted: 11/29/2023] [Indexed: 01/23/2024] Open
Abstract
The use of chatbot technology, particularly chat generative pre-trained transformer (ChatGPT) with an impressive 175 billion parameters, has garnered significant attention across various domains, including Obstetrics and Gynecology (OBGYN). This comprehensive review delves into the transformative potential of chatbots with a special focus on ChatGPT as a leading artificial intelligence (AI) technology. Moreover, ChatGPT harnesses the power of deep learning algorithms to generate responses that closely mimic human language, opening up myriad applications in medicine, research, and education. In the field of medicine, ChatGPT plays a pivotal role in diagnosis, treatment, and personalized patient education. Notably, the technology has demonstrated remarkable capabilities, surpassing human performance in OBGYN examinations, and delivering highly accurate diagnoses. However, challenges remain, including the need to verify the accuracy of the information and address the ethical considerations and limitations. In the wide scope of chatbot technology, AI systems play a vital role in healthcare processes, including documentation, diagnosis, research, and education. Although promising, the limitations and occasional inaccuracies require validation by healthcare professionals. This review also examined global chatbot adoption in healthcare, emphasizing the need for user awareness to ensure patient safety. Chatbot technology holds great promise in OBGYN and medicine, offering innovative solutions while necessitating responsible integration to ensure patient care and safety.
Collapse
Affiliation(s)
- YooKyung Lee
- Division of Maternal Fetal Medicine, Department of Obstetrics and Gynecology, MizMedi Hospital, Seoul, Korea
| | - So Yun Kim
- Division of Maternal Fetal Medicine, Department of Obstetrics and Gynecology, MizMedi Hospital, Seoul, Korea
| |
Collapse
|
42
|
Puladi B, Gsaxner C, Kleesiek J, Hölzle F, Röhrig R, Egger J. Response to the comment on "The impact and opportunities of large language models like ChatGPT in oral and maxillofacial surgery: a narrative review". Int J Oral Maxillofac Surg 2024:S0901-5027(24)00012-2. [PMID: 38310049 DOI: 10.1016/j.ijom.2023.12.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 11/29/2023] [Accepted: 12/06/2023] [Indexed: 02/05/2024]
Affiliation(s)
- B Puladi
- Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Aachen, Germany; Institute of Medical Informatics, University Hospital RWTH Aachen, Aachen, Germany
| | - C Gsaxner
- Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Aachen, Germany; Institute of Medical Informatics, University Hospital RWTH Aachen, Aachen, Germany; Institute of Computer Graphics and Vision, Graz University of Technology, Graz, Austria; Department of Oral and Maxillofacial Surgery, Medical University of Graz, Graz, Austria
| | - J Kleesiek
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen, Germany
| | - F Hölzle
- Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Aachen, Germany
| | - R Röhrig
- Institute of Medical Informatics, University Hospital RWTH Aachen, Aachen, Germany
| | - J Egger
- Institute of Computer Graphics and Vision, Graz University of Technology, Graz, Austria; Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen, Germany; Center for Virtual and Extended Reality in Medicine (ZvRM), Essen University Hospital (AöR), Essen, Germany.
| |
Collapse
|
43
|
Kavadella A, Dias da Silva MA, Kaklamanos EG, Stamatopoulos V, Giannakopoulos K. Evaluation of ChatGPT's Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024; 10:e51344. [PMID: 38111256 PMCID: PMC10867750 DOI: 10.2196/51344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 10/28/2023] [Accepted: 12/11/2023] [Indexed: 12/20/2023]
Abstract
BACKGROUND The recent artificial intelligence tool ChatGPT seems to offer a range of benefits in academic education while also raising concerns. Relevant literature encompasses issues of plagiarism and academic dishonesty, as well as pedagogy and educational affordances; yet, no real-life implementation of ChatGPT in the educational process has been reported to our knowledge so far. OBJECTIVE This mixed methods study aimed to evaluate the implementation of ChatGPT in the educational process, both quantitatively and qualitatively. METHODS In March 2023, a total of 77 second-year dental students of the European University Cyprus were divided into 2 groups and asked to compose a learning assignment on "Radiation Biology and Radiation Protection in the Dental Office," working collaboratively in small subgroups, as part of the educational semester program of the Dentomaxillofacial Radiology module. Careful planning ensured a seamless integration of ChatGPT, addressing potential challenges. One group searched the internet for scientific resources to perform the task and the other group used ChatGPT for this purpose. Both groups developed a PowerPoint (Microsoft Corp) presentation based on their research and presented it in class. The ChatGPT group students additionally registered all interactions with the language model during the prompting process and evaluated the final outcome; they also answered an open-ended evaluation questionnaire, including questions on their learning experience. Finally, all students undertook a knowledge examination on the topic, and the grades between the 2 groups were compared statistically, whereas the free-text comments of the questionnaires were thematically analyzed. RESULTS Out of the 77 students, 39 were assigned to the ChatGPT group and 38 to the literature research group. Seventy students undertook the multiple choice question knowledge examination, and examination grades ranged from 5 to 10 on the 0-10 grading scale. The Mann-Whitney U test showed that students of the ChatGPT group performed significantly better (P=.045) than students of the literature research group. The evaluation questionnaires revealed the benefits (human-like interface, immediate response, and wide knowledge base), the limitations (need for rephrasing the prompts to get a relevant answer, general content, false citations, and incapability to provide images or videos), and the prospects (in education, clinical practice, continuing education, and research) of ChatGPT. CONCLUSIONS Students using ChatGPT for their learning assignments performed significantly better in the knowledge examination than their fellow students who used the literature research methodology. Students adapted quickly to the technological environment of the language model, recognized its opportunities and limitations, and used it creatively and efficiently. Implications for practice: the study underscores the adaptability of students to technological innovations including ChatGPT and its potential to enhance educational outcomes. Educators should consider integrating ChatGPT into curriculum design; awareness programs are warranted to educate both students and educators about the limitations of ChatGPT, encouraging critical engagement and responsible use.
Collapse
Affiliation(s)
- Argyro Kavadella
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
| | - Marco Antonio Dias da Silva
- Research Group of Teleducation and Teledentistry, Federal University of Campina Grande, Campina Grande, Brazil
| | - Eleftherios G Kaklamanos
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
- School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece
- Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
| | - Vasileios Stamatopoulos
- Information Management Systems Institute, ATHENA Research and Innovation Center, Athens, Greece
| | | |
Collapse
|
44
|
Odabashian R, Bastin D, Jones G, Manzoor M, Tangestaniapour S, Assad M, Lakhani S, Odabashian M, McGee S. Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks. JMIR AI 2024; 3:e50442. [PMID: 38875575 PMCID: PMC11041475 DOI: 10.2196/50442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 10/05/2023] [Accepted: 11/19/2023] [Indexed: 06/16/2024]
Abstract
BACKGROUND ChatGPT (Open AI) is a state-of-the-art large language model that uses artificial intelligence (AI) to address questions across diverse topics. The American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) created a comprehensive educational program to help physicians keep up to date with the many rapid advances in the field. The question bank consists of multiple choice questions addressing the many facets of cancer care, including diagnosis, treatment, and supportive care. As ChatGPT applications rapidly expand, it becomes vital to ascertain if the knowledge of ChatGPT-3.5 matches the established standards that oncologists are recommended to follow. OBJECTIVE This study aims to evaluate whether ChatGPT-3.5's knowledge aligns with the established benchmarks that oncologists are expected to adhere to. This will furnish us with a deeper understanding of the potential applications of this tool as a support for clinical decision-making. METHODS We conducted a systematic assessment of the performance of ChatGPT-3.5 on the ASCO-SEP, the leading educational and assessment tool for medical oncologists in training and practice. Over 1000 multiple choice questions covering the spectrum of cancer care were extracted. Questions were categorized by cancer type or discipline, with subcategorization as treatment, diagnosis, or other. Answers were scored as correct if ChatGPT-3.5 selected the answer as defined by ASCO-SEP. RESULTS Overall, ChatGPT-3.5 achieved a score of 56.1% (583/1040) for the correct answers provided. The program demonstrated varying levels of accuracy across cancer types or disciplines. The highest accuracy was observed in questions related to developmental therapeutics (8/10; 80% correct), while the lowest accuracy was observed in questions related to gastrointestinal cancer (102/209; 48.8% correct). There was no significant difference in the program's performance across the predefined subcategories of diagnosis, treatment, and other (P=.16, which is greater than .05). CONCLUSIONS This study evaluated ChatGPT-3.5's oncology knowledge using the ASCO-SEP, aiming to address uncertainties regarding AI tools like ChatGPT in clinical decision-making. Our findings suggest that while ChatGPT-3.5 offers a hopeful outlook for AI in oncology, its present performance in ASCO-SEP tests necessitates further refinement to reach the requisite competency levels. Future assessments could explore ChatGPT's clinical decision support capabilities with real-world clinical scenarios, its ease of integration into medical workflows, and its potential to foster interdisciplinary collaboration and patient engagement in health care settings.
Collapse
Affiliation(s)
- Roupen Odabashian
- Department of Oncology, Barbara Ann Karmanos Cancer Institute, Wayne State University, Detroit, MI, United States
| | - Donald Bastin
- Department of Medicine, Division of Internal Medicine, The Ottawa Hospital and the University of Ottawa, Ottawa, ON, Canada
| | - Georden Jones
- Mary A Rackham Institute, University of Michigan, Ann Arbor, MI, United States
| | | | | | - Malke Assad
- Department of Plastic Surgery, University of Pittsburgh Medical Center, Pittsburgh, PA, United States
| | - Sunita Lakhani
- Department of Medicine, Division of Internal Medicine, Jefferson Abington Hospital, Philadelphia, PA, United States
| | - Maritsa Odabashian
- Mary A Rackham Institute, University of Michigan, Ann Arbor, MI, United States
- The Ottawa Hospital Research Institute, Ottawa, ON, Canada
| | - Sharon McGee
- Department of Medicine, Division of Medical Oncology, The Ottawa Hospital and the University of Ottawa, Ottawa, ON, Canada
- Cancer Therapeutics Program, Ottawa Hospital Research Institute, Ottawa, ON, Canada
| |
Collapse
|
45
|
Zheng Y, Sun X, Feng B, Kang K, Yang Y, Zhao A, Wu Y. Rare and complex diseases in focus: ChatGPT's role in improving diagnosis and treatment. Front Artif Intell 2024; 7:1338433. [PMID: 38283995 PMCID: PMC10808657 DOI: 10.3389/frai.2024.1338433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/02/2024] [Indexed: 01/30/2024] Open
Abstract
Rare and complex diseases pose significant challenges to both patients and healthcare providers. These conditions often present with atypical symptoms, making diagnosis and treatment a formidable task. In recent years, artificial intelligence and natural language processing technologies have shown great promise in assisting medical professionals in diagnosing and managing such conditions. This paper explores the role of ChatGPT, an advanced artificial intelligence model, in improving the diagnosis and treatment of rare and complex diseases. By analyzing its potential applications, limitations, and ethical considerations, we demonstrate how ChatGPT can contribute to better patient outcomes and enhance the healthcare system's overall effectiveness.
Collapse
Affiliation(s)
- Yue Zheng
- Cancer Center, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| | - Xu Sun
- Department of Hematology, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| | - Baijie Feng
- West China School of Medicine, Sichuan University, Chengdu, Sichuan, China
| | - Kai Kang
- Cancer Center, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| | - Yuqi Yang
- West China School of Medicine, Sichuan University, Chengdu, Sichuan, China
| | - Ailin Zhao
- Department of Hematology, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| | - Yijun Wu
- Cancer Center, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| |
Collapse
|
46
|
Padovan M, Cosci B, Petillo A, Nerli G, Porciatti F, Scarinci S, Carlucci F, Dell’Amico L, Meliani N, Necciari G, Lucisano VC, Marino R, Foddis R, Palla A. ChatGPT in Occupational Medicine: A Comparative Study with Human Experts. Bioengineering (Basel) 2024; 11:57. [PMID: 38247934 PMCID: PMC10813435 DOI: 10.3390/bioengineering11010057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 01/01/2024] [Accepted: 01/04/2024] [Indexed: 01/23/2024] Open
Abstract
The objective of this study is to evaluate ChatGPT's accuracy and reliability in answering complex medical questions related to occupational health and explore the implications and limitations of AI in occupational health medicine. The study also provides recommendations for future research in this area and informs decision-makers about AI's impact on healthcare. A group of physicians was enlisted to create a dataset of questions and answers on Italian occupational medicine legislation. The physicians were divided into two teams, and each team member was assigned a different subject area. ChatGPT was used to generate answers for each question, with/without legislative context. The two teams then evaluated human and AI-generated answers blind, with each group reviewing the other group's work. Occupational physicians outperformed ChatGPT in generating accurate questions on a 5-point Likert score, while the answers provided by ChatGPT with access to legislative texts were comparable to those of professional doctors. Still, we found that users tend to prefer answers generated by humans, indicating that while ChatGPT is useful, users still value the opinions of occupational medicine professionals.
Collapse
Affiliation(s)
- Martina Padovan
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Bianca Cosci
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Armando Petillo
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Gianluca Nerli
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Francesco Porciatti
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Sergio Scarinci
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Francesco Carlucci
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Letizia Dell’Amico
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Niccolò Meliani
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Gabriele Necciari
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Vincenzo Carmelo Lucisano
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Riccardo Marino
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Rudy Foddis
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | | |
Collapse
|
47
|
Younis HA, Eisa TAE, Nasser M, Sahib TM, Noor AA, Alyasiri OM, Salisu S, Hayder IM, Younis HA. A Systematic Review and Meta-Analysis of Artificial Intelligence Tools in Medicine and Healthcare: Applications, Considerations, Limitations, Motivation and Challenges. Diagnostics (Basel) 2024; 14:109. [PMID: 38201418 PMCID: PMC10802884 DOI: 10.3390/diagnostics14010109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 12/02/2023] [Accepted: 12/04/2023] [Indexed: 01/12/2024] Open
Abstract
Artificial intelligence (AI) has emerged as a transformative force in various sectors, including medicine and healthcare. Large language models like ChatGPT showcase AI's potential by generating human-like text through prompts. ChatGPT's adaptability holds promise for reshaping medical practices, improving patient care, and enhancing interactions among healthcare professionals, patients, and data. In pandemic management, ChatGPT rapidly disseminates vital information. It serves as a virtual assistant in surgical consultations, aids dental practices, simplifies medical education, and aids in disease diagnosis. A total of 82 papers were categorised into eight major areas, which are G1: treatment and medicine, G2: buildings and equipment, G3: parts of the human body and areas of the disease, G4: patients, G5: citizens, G6: cellular imaging, radiology, pulse and medical images, G7: doctors and nurses, and G8: tools, devices and administration. Balancing AI's role with human judgment remains a challenge. A systematic literature review using the PRISMA approach explored AI's transformative potential in healthcare, highlighting ChatGPT's versatile applications, limitations, motivation, and challenges. In conclusion, ChatGPT's diverse medical applications demonstrate its potential for innovation, serving as a valuable resource for students, academics, and researchers in healthcare. Additionally, this study serves as a guide, assisting students, academics, and researchers in the field of medicine and healthcare alike.
Collapse
Affiliation(s)
- Hussain A. Younis
- College of Education for Women, University of Basrah, Basrah 61004, Iraq
| | | | - Maged Nasser
- Computer & Information Sciences Department, Universiti Teknologi PETRONAS, Seri Iskandar 32610, Malaysia;
| | - Thaeer Mueen Sahib
- Kufa Technical Institute, Al-Furat Al-Awsat Technical University, Kufa 54001, Iraq;
| | - Ameen A. Noor
- Computer Science Department, College of Education, University of Almustansirya, Baghdad 10045, Iraq;
| | | | - Sani Salisu
- Department of Information Technology, Federal University Dutse, Dutse 720101, Nigeria;
| | - Israa M. Hayder
- Qurna Technique Institute, Southern Technical University, Basrah 61016, Iraq;
| | - Hameed AbdulKareem Younis
- Department of Cybersecurity, College of Computer Science and Information Technology, University of Basrah, Basrah 61016, Iraq;
| |
Collapse
|
48
|
Morales-Ramirez P, Mishek H, Dasgupta A. The Genie Is Out of the Bottle: What ChatGPT Can and Cannot Do for Medical Professionals. Obstet Gynecol 2024; 143:e1-e6. [PMID: 37944140 DOI: 10.1097/aog.0000000000005446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/12/2023] [Indexed: 11/12/2023]
Abstract
ChatGPT is a cutting-edge artificial intelligence technology that was released for public use in November 2022. Its rapid adoption has raised questions about capabilities, limitations, and risks. This article presents an overview of ChatGPT, and it highlights the current state of this technology for the medical field. The article seeks to provide a balanced perspective on what the model can and cannot do in three specific domains: clinical practice, research, and medical education. It also provides suggestions on how to optimize the use of this tool.
Collapse
|
49
|
Mannstadt I, Mehta B. Large language models and the future of rheumatology: assessing impact and emerging opportunities. Curr Opin Rheumatol 2024; 36:46-51. [PMID: 37729050 DOI: 10.1097/bor.0000000000000981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
PURPOSE OF REVIEW Large language models (LLMs) have grown rapidly in size and capabilities as more training data and compute power has become available. Since the release of ChatGPT in late 2022, there has been growing interest and exploration around potential applications of LLM technology. Numerous examples and pilot studies demonstrating the capabilities of these tools have emerged across several domains. For rheumatology professionals and patients, LLMs have the potential to transform current practices in medicine. RECENT FINDINGS Recent studies have begun exploring capabilities of LLMs that can assist rheumatologists in clinical practice, research, and medical education, though applications are still emerging. In clinical settings, LLMs have shown promise in assist healthcare professionals enabling more personalized medicine or generating routine documentation like notes and letters. Challenges remain around integrating LLMs into clinical workflows, accuracy of the LLMs and ensuring patient data confidentiality. In research, early experiments demonstrate LLMs can offer analysis of datasets, with quality control as a critical piece. Lastly, LLMs could supplement medical education by providing personalized learning experiences and integration into established curriculums. SUMMARY As these powerful tools continue evolving at a rapid pace, rheumatology professionals should stay informed on how they may impact the field.
Collapse
Affiliation(s)
| | - Bella Mehta
- Weill Cornell Medicine
- Hospital for Special Surgery, New York, New York, USA
| |
Collapse
|
50
|
Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J Med Internet Res 2023; 25:e51580. [PMID: 38009003 PMCID: PMC10784979 DOI: 10.2196/51580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 10/15/2023] [Accepted: 11/20/2023] [Indexed: 11/28/2023] Open
Abstract
BACKGROUND The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including dentistry, raises questions about their accuracy. OBJECTIVE This study aims to comparatively evaluate the answers provided by 4 LLMs, namely Bard (Google LLC), ChatGPT-3.5 and ChatGPT-4 (OpenAI), and Bing Chat (Microsoft Corp), to clinically relevant questions from the field of dentistry. METHODS The LLMs were queried with 20 open-type, clinical dentistry-related questions from different disciplines, developed by the respective faculty of the School of Dentistry, European University Cyprus. The LLMs' answers were graded 0 (minimum) to 10 (maximum) points against strong, traditionally collected scientific evidence, such as guidelines and consensus statements, using a rubric, as if they were examination questions posed to students, by 2 experienced faculty members. The scores were statistically compared to identify the best-performing model using the Friedman and Wilcoxon tests. Moreover, the evaluators were asked to provide a qualitative evaluation of the comprehensiveness, scientific accuracy, clarity, and relevance of the LLMs' answers. RESULTS Overall, no statistically significant difference was detected between the scores given by the 2 evaluators; therefore, an average score was computed for every LLM. Although ChatGPT-4 statistically outperformed ChatGPT-3.5 (P=.008), Bing Chat (P=.049), and Bard (P=.045), all models occasionally exhibited inaccuracies, generality, outdated content, and a lack of source references. The evaluators noted instances where the LLMs delivered irrelevant information, vague answers, or information that was not fully accurate. CONCLUSIONS This study demonstrates that although LLMs hold promising potential as an aid in the implementation of evidence-based dentistry, their current limitations can lead to potentially harmful health care decisions if not used judiciously. Therefore, these tools should not replace the dentist's critical thinking and in-depth understanding of the subject matter. Further research, clinical validation, and model improvements are necessary for these tools to be fully integrated into dental practice. Dental practitioners must be aware of the limitations of LLMs, as their imprudent use could potentially impact patient care. Regulatory measures should be established to oversee the use of these evolving technologies.
Collapse
Affiliation(s)
| | - Argyro Kavadella
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
| | - Anas Aaqel Salim
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
| | - Vassilis Stamatopoulos
- Information Management Systems Institute, ATHENA Research and Innovation Center, Athens, Greece
| | - Eleftherios G Kaklamanos
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
- School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece
- Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
| |
Collapse
|