Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res 2023;25:e48568. [PMID: 37379067 PMCID: PMC10365580 DOI: 10.2196/48568] [Citation(s) in RCA: 56] [Impact Index Per Article: 56.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 05/29/2023] [Accepted: 06/15/2023] [Indexed: 06/29/2023] Open

For:	Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res 2023;25:e48568. [PMID: 37379067 PMCID: PMC10365580 DOI: 10.2196/48568] [Citation(s) in RCA: 56] [Impact Index Per Article: 56.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 05/29/2023] [Accepted: 06/15/2023] [Indexed: 06/29/2023] Open

Number

Cited by Other Article(s)

Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, Freire Y. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2024;24:46-52. [PMID: 38162955 PMCID: PMC10755495 DOI: 10.1016/j.csbj.2023.11.058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/03/2024] Open

Turan Eİ, Baydemir AE, Özcan FG, Şahin AS. Evaluating the accuracy of ChatGPT-4 in predicting ASA scores: A prospective multicentric study ChatGPT-4 in ASA score prediction. J Clin Anesth 2024;96:111475. [PMID: 38657530 DOI: 10.1016/j.jclinane.2024.111475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 04/18/2024] [Indexed: 04/26/2024]

Abstract

BACKGROUND

This study investigates the potential of ChatGPT-4, developed by OpenAI, in enhancing medical decision-making processes, particularly in preoperative assessments using the American Society of Anesthesiologists (ASA) scoring system. The ASA score, a critical tool in evaluating patients' health status and anesthesia risks before surgery, categorizes patients from I to VI based on their overall health and risk factors. Despite its widespread use, determining accurate ASA scores remains a subjective process that may benefit from AI-supported assessments. This research aims to evaluate ChatGPT-4's capability to predict ASA scores accurately compared to expert anesthesiologists' assessments.

METHODS

In this prospective multicentric study, ethical board approval was obtained, and the study was registered with clinicaltrials.gov (NCT06321445). We included 2851 patients from anesthesiology outpatient clinics, spanning neonates to all age groups and genders, with ASA scores between I-IV. Exclusion criteria were set for ASA V and VI scores, emergency operations, and insufficient information for ASA score determination. Data on patients' demographics, health conditions, and ASA scores by anesthesiologists were collected and anonymized. ChatGPT-4 was then tasked with assigning ASA scores based on the standardized patient data.

RESULTS

Our results indicate a high level of concordance between ChatGPT-4 predictions and anesthesiologists' evaluations, with Cohen's kappa analysis showing a kappa value of 0.858 (p = 0.000). While the model demonstrated over 90% accuracy in predicting ASA scores I to III, it showed a notable variance in ASA IV scores, suggesting a potential limitation in assessing patients with more complex health conditions.

DISCUSSION

The findings suggest that ChatGPT-4 can significantly contribute to the medical field by supporting anesthesiologists in preoperative assessments. This study not only demonstrates ChatGPT-4's efficacy in medical data analysis and decision-making but also opens new avenues for AI applications in healthcare, particularly in enhancing patient safety and optimizing surgical outcomes. Further research is needed to refine AI models for complex case assessments and integrate them seamlessly into clinical workflows.

Collapse

Karacan E. Evaluating the Quality of Postpartum Hemorrhage Nursing Care Plans Generated by Artificial Intelligence Models. J Nurs Care Qual 2024;39:206-211. [PMID: 38701406 DOI: 10.1097/ncq.0000000000000766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/05/2024]

Levin C, Kagan T, Rosen S, Saban M. An evaluation of the capabilities of language models and nurses in providing neonatal clinical decision support. Int J Nurs Stud 2024;155:104771. [PMID: 38688103 DOI: 10.1016/j.ijnurstu.2024.104771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 03/26/2024] [Accepted: 04/03/2024] [Indexed: 05/02/2024]

Alkhalaf M, Yu P, Yin M, Deng C. Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. J Biomed Inform 2024:104662. [PMID: 38880236 DOI: 10.1016/j.jbi.2024.104662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 05/25/2024] [Accepted: 05/28/2024] [Indexed: 06/18/2024]

Abstract

BACKGROUND

Malnutrition is a prevalent issue in aged care facilities (RACFs), leading to adverse health outcomes. The ability to efficiently extract key clinical information from a large volume of data in electronic health records (EHR) can improve understanding about the extent of the problem and developing effective interventions. This research aimed to test the efficacy of zero-shot prompt engineering applied to generative artificial intelligence (AI) models on their own and in combination with retrieval augmented generation (RAG), for the automating tasks of summarizing both structured and unstructured data in EHR and extracting important malnutrition information.

METHODOLOGY

We utilized Llama 2 13B model with zero-shot prompting. The dataset comprises unstructured and structured EHRs related to malnutrition management in 40 Australian RACFs. We employed zero-shot learning to the model alone first, then combined it with RAG to accomplish two tasks: generate structured summaries about the nutritional status of a client and extract key information about malnutrition risk factors. We utilized 25 notes in the first task and 1,399 in the second task. We evaluated the model's output of each task manually against a gold standard dataset.

RESULT

The evaluation outcomes indicated that zero-shot learning applied to generative AI model is highly effective in summarizing and extracting information about nutritional status of RACFs' clients. The generated summaries provided concise and accurate representation of the original data with an overall accuracy of 93.25%. The addition of RAG improved the summarization process, leading to a 6% increase and achieving an accuracy of 99.25%. The model also proved its capability in extracting risk factors with an accuracy of 90%. However, adding RAG did not further improve accuracy in this task. Overall, the model has shown a robust performance when information was explicitly stated in the notes; however, it could encounter hallucination limitations, particularly when details were not explicitly provided.

CONCLUSION

This study demonstrates the high performance and limitations of applying zero-shot learning to generative AI models to automatic generation of structured summarization of EHRs data and extracting key clinical information. The inclusion of the RAG approach improved the model performance and mitigated the hallucination problem.

Collapse

Puerto Nino AK, Garcia Perez V, Secco S, De Nunzio C, Lombardo R, Tikkinen KAO, Elterman DS. Can ChatGPT provide high-quality patient information on male lower urinary tract symptoms suggestive of benign prostate enlargement? Prostate Cancer Prostatic Dis 2024:10.1038/s41391-024-00847-7. [PMID: 38871841 DOI: 10.1038/s41391-024-00847-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 05/03/2024] [Accepted: 05/10/2024] [Indexed: 06/15/2024]

Wu Y, Wu M, Wang C, Lin J, Liu J, Liu S. Evaluating the Prevalence of Burnout Among Health Care Professionals Related to Electronic Health Record Use: Systematic Review and Meta-Analysis. JMIR Med Inform 2024;12:e54811. [PMID: 38865188 DOI: 10.2196/54811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 02/23/2024] [Accepted: 04/17/2024] [Indexed: 06/13/2024] Open

Abstract

BACKGROUND

Burnout among health care professionals is a significant concern, with detrimental effects on health care service quality and patient outcomes. The use of the electronic health record (EHR) system has been identified as a significant contributor to burnout among health care professionals.

OBJECTIVE

This systematic review and meta-analysis aims to assess the prevalence of burnout among health care professionals associated with the use of the EHR system, thereby providing evidence to improve health information systems and develop strategies to measure and mitigate burnout.

METHODS

We conducted a comprehensive search of the PubMed, Embase, and Web of Science databases for English-language peer-reviewed articles published between January 1, 2009, and December 31, 2022. Two independent reviewers applied inclusion and exclusion criteria, and study quality was assessed using the Joanna Briggs Institute checklist and the Newcastle-Ottawa Scale. Meta-analyses were performed using R (version 4.1.3; R Foundation for Statistical Computing), with EndNote X7 (Clarivate) for reference management.

RESULTS

The review included 32 cross-sectional studies and 5 case-control studies with a total of 66,556 participants, mainly physicians and registered nurses. The pooled prevalence of burnout among health care professionals in cross-sectional studies was 40.4% (95% CI 37.5%-43.2%). Case-control studies indicated a higher likelihood of burnout among health care professionals who spent more time on EHR-related tasks outside work (odds ratio 2.43, 95% CI 2.31-2.57).

CONCLUSIONS

The findings highlight the association between the increased use of the EHR system and burnout among health care professionals. Potential solutions include optimizing EHR systems, implementing automated dictation or note-taking, employing scribes to reduce documentation burden, and leveraging artificial intelligence to enhance EHR system efficiency and reduce the risk of burnout.

TRIAL REGISTRATION

PROSPERO International Prospective Register of Systematic Reviews CRD42021281173; https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281173.

Collapse

Elwyn G, Ryan P, Blumkin D, Weeks WB. Meet generative AI… your new shared decision-making assistant. BMJ Evid Based Med 2024:bmjebm-2023-112651. [PMID: 38866469 DOI: 10.1136/bmjebm-2023-112651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/25/2024] [Indexed: 06/14/2024]

Moura L, Jones DT, Sheikh IS, Murphy S, Kalfin M, Kummer BR, Weathers AL, Grinspan ZM, Silsbee HM, Jones LK, Patel AD. Implications of Large Language Models for Quality and Efficiency of Neurologic Care: Emerging Issues in Neurology. Neurology 2024;102:e209497. [PMID: 38759131 DOI: 10.1212/wnl.0000000000209497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024] Open

Abstract

Large language models (LLMs) are advanced artificial intelligence (AI) systems that excel in recognizing and generating human-like language, possibly serving as valuable tools for neurology-related information tasks. Although LLMs have shown remarkable potential in various areas, their performance in the dynamic environment of daily clinical practice remains uncertain. This article outlines multiple limitations and challenges of using LLMs in clinical settings that need to be addressed, including limited clinical reasoning, variable reliability and accuracy, reproducibility bias, self-serving bias, sponsorship bias, and potential for exacerbating health care disparities. These challenges are further compounded by practical business considerations and infrastructure requirements, including associated costs. To overcome these hurdles and harness the potential of LLMs effectively, this article includes considerations for health care organizations, researchers, and neurologists contemplating the use of LLMs in clinical practice. It is essential for health care organizations to cultivate a culture that welcomes AI solutions and aligns them seamlessly with health care operations. Clear objectives and business plans should guide the selection of AI solutions, ensuring they meet organizational needs and budget considerations. Engaging both clinical and nonclinical stakeholders can help secure necessary resources, foster trust, and ensure the long-term sustainability of AI implementations. Testing, validation, training, and ongoing monitoring are pivotal for successful integration. For neurologists, safeguarding patient data privacy is paramount. Seeking guidance from institutional information technology resources for informed, compliant decisions, and remaining vigilant against biases in LLM outputs are essential practices in responsible and unbiased utilization of AI tools. In research, obtaining institutional review board approval is crucial when dealing with patient data, even if deidentified, to ensure ethical use. Compliance with established guidelines like SPIRIT-AI, MI-CLAIM, and CONSORT-AI is necessary to maintain consistency and mitigate biases in AI research. In summary, the integration of LLMs into clinical neurology offers immense promise while presenting formidable challenges. Awareness of these considerations is vital for harnessing the potential of AI in neurologic care effectively and enhancing patient care quality and safety. The article serves as a guide for health care organizations, researchers, and neurologists navigating this transformative landscape.

Collapse

Affiliation(s)

Lidia Moura From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
David T Jones From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Irfan S Sheikh From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Shawn Murphy From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Michael Kalfin From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Benjamin R Kummer From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Allison L Weathers From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Zachary M Grinspan From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Heather M Silsbee From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Lyell K Jones From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus
Anup D Patel From the Center for Value-based Health Care and Sciences (L.M.), and Department of Neurology (L.M., S.M.), Massachusetts General Hospital, Boston; Harvard Medical School (L.M., S.M.), Boston, MA; Department of Neurology (D.T.J., L.K.J.), Mayo Clinic, Rochester, MN; Department of Neurology (I.S.S.), University of Texas Southwestern Medical Center, Dallas; Department of Neurology (M.K.), University of Pennsylvania Health System, Philadelphia; Department of Neurology (B.R.K.), Icahn School of Medicine at Mount Sinai, New York, NY; Information Technology Division (A.L.W.), Cleveland Clinic, OH; Department of Pediatrics (Z.M.G.), Weill Cornell Medicine, New York, NY; American Academy of Neurology (H.M.S.), Minneapolis, MN; and The Center for Clinical Excellence (A.D.P.), Nationwide Children's Hospital, Division of Neurology, The Ohio State University College of Medicine, Columbus

Collapse

Roldan-Vasquez E, Mitri S, Bhasin S, Bharani T, Capasso K, Haslinger M, Sharma R, James TA. Reliability of artificial intelligence chatbot responses to frequently asked questions in breast surgical oncology. J Surg Oncol 2024. [PMID: 38837375 DOI: 10.1002/jso.27715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Accepted: 05/21/2024] [Indexed: 06/07/2024]

Harada Y, Suzuki T, Harada T, Sakamoto T, Ishizuka K, Miyagami T, Kawamura R, Kunitomo K, Nagano H, Shimizu T, Watari T. Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors. BMJ Open Qual 2024;13:e002654. [PMID: 38830730 PMCID: PMC11149143 DOI: 10.1136/bmjoq-2023-002654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open

Abstract

BACKGROUND

Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.

OBJECTIVE

This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.

METHODS

We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.

RESULTS

ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP.

CONCLUSION

ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.

Collapse

Balasanjeevi G, Surapaneni KM. Comparison of ChatGPT version 3.5 & 4 for utility in respiratory medicine education using clinical case scenarios. Respir Med Res 2024;85:101091. [PMID: 38657295 DOI: 10.1016/j.resmer.2024.101091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 02/14/2024] [Accepted: 02/15/2024] [Indexed: 04/26/2024]

Shen SA, Perez-Heydrich CA, Xie DX, Nellis JC. ChatGPT vs. web search for patient questions: what does ChatGPT do better? Eur Arch Otorhinolaryngol 2024;281:3219-3225. [PMID: 38416195 DOI: 10.1007/s00405-024-08524-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 01/31/2024] [Indexed: 02/29/2024]

Abstract

PURPOSE

Chat generative pretrained transformer (ChatGPT) has the potential to significantly impact how patients acquire medical information online. Here, we characterize the readability and appropriateness of ChatGPT responses to a range of patient questions compared to results from traditional web searches.

METHODS

Patient questions related to the published Clinical Practice Guidelines by the American Academy of Otolaryngology-Head and Neck Surgery were sourced from existing online posts. Questions were categorized using a modified Rothwell classification system into (1) fact, (2) policy, and (3) diagnosis and recommendations. These were queried using ChatGPT and traditional web search. All results were evaluated on readability (Flesch Reading Ease and Flesch-Kinkaid Grade Level) and understandability (Patient Education Materials Assessment Tool). Accuracy was assessed by two blinded clinical evaluators using a three-point ordinal scale.

RESULTS

54 questions were organized into fact (37.0%), policy (37.0%), and diagnosis (25.8%). The average readability for ChatGPT responses was lower than traditional web search (FRE: 42.3 ± 13.1 vs. 55.6 ± 10.5, p < 0.001), while the PEMAT understandability was equivalent (93.8% vs. 93.5%, p = 0.17). ChatGPT scored higher than web search for questions the 'Diagnosis' category (p < 0.01); there was no difference in questions categorized as 'Fact' (p = 0.15) or 'Policy' (p = 0.22). Additional prompting improved ChatGPT response readability (FRE 55.6 ± 13.6, p < 0.01).

CONCLUSIONS

ChatGPT outperforms web search in answering patient questions related to symptom-based diagnoses and is equivalent in providing medical facts and established policy. Appropriate prompting can further improve readability while maintaining accuracy. Further patient education is needed to relay the benefits and limitations of this technology as a source of medial information.

Collapse

Koga S, Du W. Integrating AI in medicine: Lessons from Chat-GPT's limitations in medical imaging. Dig Liver Dis 2024;56:1114-1115. [PMID: 38429138 DOI: 10.1016/j.dld.2024.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Accepted: 02/19/2024] [Indexed: 03/03/2024]

Baldwin AJ. An artificial intelligence language model improves readability of burns first aid information. Burns 2024;50:1122-1127. [PMID: 38492982 DOI: 10.1016/j.burns.2024.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 01/29/2024] [Accepted: 03/05/2024] [Indexed: 03/18/2024]

Lee VV, van der Lubbe SCC, Goh LH, Valderas JM. Harnessing ChatGPT for Thematic Analysis: Are We Ready? J Med Internet Res 2024;26:e54974. [PMID: 38819896 PMCID: PMC11179012 DOI: 10.2196/54974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 02/28/2024] [Accepted: 03/20/2024] [Indexed: 06/01/2024] Open

Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Cuny C, Snijders JP, Ernst BP, Blaikie A, Kelsey T, Kuhn S, Eckrich J. Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology. Acta Otolaryngol 2024:1-6. [PMID: 38781053 DOI: 10.1080/00016489.2024.2352843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 05/03/2024] [Indexed: 05/25/2024]

Liu S, McCoy AB, Wright AP, Nelson SD, Huang SS, Ahmad HB, Carro SE, Franklin J, Brogan J, Wright A. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc 2024;31:1388-1396. [PMID: 38452289 PMCID: PMC11105133 DOI: 10.1093/jamia/ocae041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 02/06/2024] [Accepted: 02/21/2024] [Indexed: 03/09/2024] Open

Abstract

OBJECTIVES

To evaluate the capability of using generative artificial intelligence (AI) in summarizing alert comments and to determine if the AI-generated summary could be used to improve clinical decision support (CDS) alerts.

MATERIALS AND METHODS

We extracted user comments to alerts generated from September 1, 2022 to September 1, 2023 at Vanderbilt University Medical Center. For a subset of 8 alerts, comment summaries were generated independently by 2 physicians and then separately by GPT-4. We surveyed 5 CDS experts to rate the human-generated and AI-generated summaries on a scale from 1 (strongly disagree) to 5 (strongly agree) for the 4 metrics: clarity, completeness, accuracy, and usefulness.

RESULTS

Five CDS experts participated in the survey. A total of 16 human-generated summaries and 8 AI-generated summaries were assessed. Among the top 8 rated summaries, five were generated by GPT-4. AI-generated summaries demonstrated high levels of clarity, accuracy, and usefulness, similar to the human-generated summaries. Moreover, AI-generated summaries exhibited significantly higher completeness and usefulness compared to the human-generated summaries (AI: 3.4 ± 1.2, human: 2.7 ± 1.2, P = .001).

CONCLUSION

End-user comments provide clinicians' immediate feedback to CDS alerts and can serve as a direct and valuable data resource for improving CDS delivery. Traditionally, these comments may not be considered in the CDS review process due to their unstructured nature, large volume, and the presence of redundant or irrelevant content. Our study demonstrates that GPT-4 is capable of distilling these comments into summaries characterized by high clarity, accuracy, and completeness. AI-generated summaries are equivalent and potentially better than human-generated summaries. These AI-generated summaries could provide CDS experts with a novel means of reviewing user comments to rapidly optimize CDS alerts both online and offline.

Collapse

Liu S, McCoy AB, Wright AP, Carew B, Genkins JZ, Huang SS, Peterson JF, Steitz B, Wright A. Leveraging large language models for generating responses to patient messages-a subjective analysis. J Am Med Inform Assoc 2024;31:1367-1379. [PMID: 38497958 PMCID: PMC11105129 DOI: 10.1093/jamia/ocae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/17/2024] [Accepted: 02/28/2024] [Indexed: 03/19/2024] Open

Abstract

OBJECTIVE

This study aimed to develop and assess the performance of fine-tuned large language models for generating responses to patient messages sent via an electronic health record patient portal.

MATERIALS AND METHODS

Utilizing a dataset of messages and responses extracted from the patient portal at a large academic medical center, we developed a model (CLAIR-Short) based on a pre-trained large language model (LLaMA-65B). In addition, we used the OpenAI API to update physician responses from an open-source dataset into a format with informative paragraphs that offered patient education while emphasizing empathy and professionalism. By combining with this dataset, we further fine-tuned our model (CLAIR-Long). To evaluate fine-tuned models, we used 10 representative patient portal questions in primary care to generate responses. We asked primary care physicians to review generated responses from our models and ChatGPT and rated them for empathy, responsiveness, accuracy, and usefulness.

RESULTS

The dataset consisted of 499 794 pairs of patient messages and corresponding responses from the patient portal, with 5000 patient messages and ChatGPT-updated responses from an online platform. Four primary care physicians participated in the survey. CLAIR-Short exhibited the ability to generate concise responses similar to provider's responses. CLAIR-Long responses provided increased patient educational content compared to CLAIR-Short and were rated similarly to ChatGPT's responses, receiving positive evaluations for responsiveness, empathy, and accuracy, while receiving a neutral rating for usefulness.

CONCLUSION

This subjective analysis suggests that leveraging large language models to generate responses to patient messages demonstrates significant potential in facilitating communication between patients and healthcare providers.

Collapse

Bridges JM. Computerized diagnostic decision support systems - a comparative performance study of Isabel Pro vs. ChatGPT4. Diagnosis (Berl) 2024;0:dx-2024-0033. [PMID: 38709491 DOI: 10.1515/dx-2024-0033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 04/22/2024] [Indexed: 05/07/2024]

Makhoul M, Melkane AE, Khoury PE, Hadi CE, Matar N. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol 2024;281:2717-2721. [PMID: 38365990 DOI: 10.1007/s00405-024-08509-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 01/24/2024] [Indexed: 02/18/2024]

Temsah MH, Jamal A, Alhasan K, Aljamaan F, Altamimi I, Malki KH, Temsah A, Ohannessian R, Al-Eyadhy A. Transforming Virtual Healthcare: The Potentials of ChatGPT-4omni in Telemedicine. Cureus 2024;16:e61377. [PMID: 38817799 PMCID: PMC11139454 DOI: 10.7759/cureus.61377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/30/2024] [Indexed: 06/01/2024] Open

Yilmaz Muluk S. Enhancing Musculoskeletal Injection Safety: Evaluating Checklists Generated by Artificial Intelligence and Revising the Preformed Checklist. Cureus 2024;16:e59708. [PMID: 38841023 PMCID: PMC11150897 DOI: 10.7759/cureus.59708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/02/2024] [Indexed: 06/07/2024] Open

Abstract

Background Musculoskeletal disorders are a significant global health issue, necessitating advanced management strategies such as intra-articular and extra-articular injections to alleviate pain, inflammation, and mobility challenges. As the adoption of these interventions by physicians grows, the importance of robust safety protocols becomes paramount. This study evaluates the effectiveness of conversational artificial intelligence (AI), particularly versions 3.5 and 4 of Chat Generative Pre-trained Transformer (ChatGPT), in creating patient safety checklists for managing musculoskeletal injections to enhance the preparation of safety documentation. Methodology A quantitative analysis was conducted to evaluate AI-generated safety checklists against a preformed checklist adapted from reputable medical sources. Adherence of the generated checklists to the preformed checklist was calculated and classified. The Wilcoxon signed-rank test was used to assess the performance differences between ChatGPT versions 3.5 and 4. Results ChatGPT-4 showed superior adherence to the preformed checklist compared to ChatGPT-3.5, with both versions classified as very good in safety protocol creation. Although no significant differences were present in the sign-in and sign-out parts of the checklists of both versions, ChatGPT-4 had significantly higher scores in the procedure planning part (p = 0.007), and its overall performance was also higher (p < 0.001). Subsequently, the preformed checklist was revised to incorporate new contributions from ChatGPT. Conclusions ChatGPT, especially version 4, proved effective in generating patient safety checklists for musculoskeletal injections, highlighting the potential of AI to streamline clinical practices. Further enhancements are necessary to fully meet the medical standards.

Collapse

Miao J, Thongprayoon C, Fülöp T, Cheungpasitporn W. Enhancing clinical decision-making: Optimizing ChatGPT's performance in hypertension care. J Clin Hypertens (Greenwich) 2024;26:588-593. [PMID: 38646920 PMCID: PMC11088425 DOI: 10.1111/jch.14822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 03/27/2024] [Accepted: 03/28/2024] [Indexed: 04/23/2024]

Ferdush J, Begum M, Hossain ST. ChatGPT and Clinical Decision Support: Scope, Application, and Limitations. Ann Biomed Eng 2024;52:1119-1124. [PMID: 37516680 DOI: 10.1007/s10439-023-03329-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Accepted: 07/18/2023] [Indexed: 07/31/2023]

Safrai M, Azaria A. Does small talk with a medical provider affect ChatGPT's medical counsel? Performance of ChatGPT on USMLE with and without distractions. PLoS One 2024;19:e0302217. [PMID: 38687696 PMCID: PMC11060598 DOI: 10.1371/journal.pone.0302217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 03/28/2024] [Indexed: 05/02/2024] Open

Abstract

Efforts are being made to improve the time effectiveness of healthcare providers. Artificial intelligence tools can help transcript and summarize physician-patient encounters and produce medical notes and medical recommendations. However, in addition to medical information, discussion between healthcare and patients includes small talk and other information irrelevant to medical concerns. As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple-choice and open-ended questions. First, we gathered small talk sentences from human participants using the Mechanical Turk platform. Second, both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. Finally, a board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data (66.8% vs. 56.6%; p = 0.025). Specifically, for multiple-choice questions (72.1% vs. 68.9%; p = 0.67) and for the open questions (61.5% vs. 44.3%; p = 0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6% and 66.2%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.

Collapse

Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]

Abstract

Background

The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators.

Objective

This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare.

Methods

We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns.

Results

Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research.

Conclusions

Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.

Collapse

Quttainah M, Mishra V, Madakam S, Lurie Y, Mark S. Cost, Usability, Credibility, Fairness, Accountability, Transparency, and Explainability Framework for Safe and Effective Large Language Models in Medical Education: Narrative Review and Qualitative Study. JMIR AI 2024;3:e51834. [PMID: 38875562 PMCID: PMC11077408 DOI: 10.2196/51834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 12/20/2023] [Accepted: 02/03/2024] [Indexed: 06/16/2024]

Abstract

BACKGROUND

The world has witnessed increased adoption of large language models (LLMs) in the last year. Although the products developed using LLMs have the potential to solve accessibility and efficiency problems in health care, there is a lack of available guidelines for developing LLMs for health care, especially for medical education.

OBJECTIVE

The aim of this study was to identify and prioritize the enablers for developing successful LLMs for medical education. We further evaluated the relationships among these identified enablers.

METHODS

A narrative review of the extant literature was first performed to identify the key enablers for LLM development. We additionally gathered the opinions of LLM users to determine the relative importance of these enablers using an analytical hierarchy process (AHP), which is a multicriteria decision-making method. Further, total interpretive structural modeling (TISM) was used to analyze the perspectives of product developers and ascertain the relationships and hierarchy among these enablers. Finally, the cross-impact matrix-based multiplication applied to a classification (MICMAC) approach was used to determine the relative driving and dependence powers of these enablers. A nonprobabilistic purposive sampling approach was used for recruitment of focus groups.

RESULTS

The AHP demonstrated that the most important enabler for LLMs was credibility, with a priority weight of 0.37, followed by accountability (0.27642) and fairness (0.10572). In contrast, usability, with a priority weight of 0.04, showed negligible importance. The results of TISM concurred with the findings of the AHP. The only striking difference between expert perspectives and user preference evaluation was that the product developers indicated that cost has the least importance as a potential enabler. The MICMAC analysis suggested that cost has a strong influence on other enablers. The inputs of the focus group were found to be reliable, with a consistency ratio less than 0.1 (0.084).

CONCLUSIONS

This study is the first to identify, prioritize, and analyze the relationships of enablers of effective LLMs for medical education. Based on the results of this study, we developed a comprehendible prescriptive framework, named CUC-FATE (Cost, Usability, Credibility, Fairness, Accountability, Transparency, and Explainability), for evaluating the enablers of LLMs in medical education. The study findings are useful for health care professionals, health technology experts, medical technology regulators, and policy makers.

Collapse

Rosselló-Jiménez D, Docampo S, Collado Y, Cuadra-Llopart L, Riba F, Llonch-Masriera M. Geriatrics and artificial intelligence in Spain (Ger-IA project): talking to ChatGPT, a nationwide survey. Eur Geriatr Med 2024:10.1007/s41999-024-00970-7. [PMID: 38615289 DOI: 10.1007/s41999-024-00970-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 03/04/2024] [Indexed: 04/15/2024]

Abstract

PURPOSE

The purposes of the study was to describe the degree of agreement between geriatricians with the answers given by an AI tool (ChatGPT) in response to questions related to different areas in geriatrics, to study the differences between specialists and residents in geriatrics in terms of the degree of agreement with ChatGPT, and to analyse the mean scores obtained by areas of knowledge/domains.

METHODS

An observational study was conducted involving 126 doctors from 41 geriatric medicine departments in Spain. Ten questions about geriatric medicine were posed to ChatGPT, and doctors evaluated the AI's answers using a Likert scale. Sociodemographic variables were included. Questions were categorized into five knowledge domains, and means and standard deviations were calculated for each.

RESULTS

130 doctors answered the questionnaire. 126 doctors (69.8% women, mean age 41.4 [9.8]) were included in the final analysis. The mean score obtained by ChatGPT was 3.1/5 [0.67]. Specialists rated ChatGPT lower than residents (3.0/5 vs. 3.3/5 points, respectively, P < 0.05). By domains, ChatGPT scored better (M: 3.96; SD: 0.71) in general/theoretical questions rather than in complex decisions/end-of-life situations (M: 2.50; SD: 0.76) and answers related to diagnosis/performing of complementary tests obtained the lowest ones (M: 2.48; SD: 0.77).

CONCLUSION

Scores presented big variability depending on the area of knowledge. Questions related to theoretical aspects of challenges/future in geriatrics obtained better scores. When it comes to complex decision-making, appropriateness of the therapeutic efforts or decisions about diagnostic tests, professionals indicated a poorer performance. AI is likely to be incorporated into some areas of medicine, but it would still present important limitations, mainly in complex medical decision-making.

Collapse

Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, Shimizu T. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration. JMIR Med Inform 2024;12:e55627. [PMID: 38592758 DOI: 10.2196/55627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 02/14/2024] [Accepted: 03/13/2024] [Indexed: 04/10/2024] Open

Abstract

BACKGROUND

In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear.

OBJECTIVE

This study aims to assess the impact of adding image data on ChatGPT-4's diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data.

METHODS

We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis.

RESULTS

The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V's performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ2 test). Additionally, ChatGPT-4's self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases.

CONCLUSIONS

Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine.

Collapse

Pavlovic ZJ, Jiang VS, Hariton E. Current applications of artificial intelligence in assisted reproductive technologies through the perspective of a patient's journey. Curr Opin Obstet Gynecol 2024:00001703-990000000-00122. [PMID: 38597425 DOI: 10.1097/gco.0000000000000951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]

Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope 2024. [PMID: 38563415 DOI: 10.1002/lary.31434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/05/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]

Teixeira-Marques F, Medeiros N, Nazaré F, Alves S, Lima N, Ribeiro L, Gama R, Oliveira P. Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study. Eur Arch Otorhinolaryngol 2024;281:2023-2030. [PMID: 38345613 DOI: 10.1007/s00405-024-08498-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 01/23/2024] [Indexed: 03/16/2024]

Li Z, Shu Y, Tang Y, Lai W. A commentary on 'Auxiliary use of ChatGPT in surgical diagnosis and treatment' (Int J Surg 109 (2023) 3940-3943). Int J Surg 2024;110:2492-2493. [PMID: 38241341 PMCID: PMC11020027 DOI: 10.1097/js9.0000000000001126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 01/08/2024] [Indexed: 01/21/2024]

Javid M, Bhandari M, Parameshwari P, Reddiboina M, Prasad S. Evaluation of ChatGPT for Patient Counseling in Kidney Stone Clinic: A Prospective Study. J Endourol 2024;38:377-383. [PMID: 38411835 DOI: 10.1089/end.2023.0571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/28/2024] Open

Abstract

Introduction: The potential of large language models (LLMs) is to improve the clinical workflow and to make patient care efficient. We prospectively evaluated the performance of the LLM ChatGPT as a patient counseling tool in the urology stone clinic and validated the generated responses with those of urologists. Methods: We collected 61 questions from 12 kidney stone patients and prompted those to ChatGPT and a panel of experienced urologists (Level 1). Subsequently, the blinded responses of urologists and ChatGPT were presented to two expert urologists (Level 2) for comparative evaluation on preset domains: accuracy, relevance, empathy, completeness, and practicality. All responses were rated on a Likert scale of 1 to 10 for psychometric response evaluation. The mean difference in the scores given by the urologists (Level 2) was analyzed and interrater reliability (IRR) for the level of agreement in the responses between the urologists (Level 2) was analyzed by Cohen's kappa. Results: The mean differences in average scores between the responses from ChatGPT and urologists showed significant differences in accuracy (p < 0.001), empathy (p < 0.001), completeness (p < 0.001), and practicality (p < 0.001), except for the relevance domain (p = 0.051), with ChatGPT's responses being rated higher. The IRR analysis revealed significant agreement only in the empathy domain [k = 0.163, (0.059-0.266)]. Conclusion: We believe the introduction of ChatGPT in the clinical workflow could further optimize the information provided to patients in a busy stone clinic. In this preliminary study, ChatGPT supplemented the answers provided by the urologists, adding value to the conversation. However, in its current state, it is still not ready to be a direct source of authentic information for patients. We recommend its use as a source to build a comprehensive Frequently Asked Questions bank as a prelude to developing an LLM Chatbot for patient counseling.

Collapse

Caglayan A, Slusarczyk W, Rabbani RD, Ghose A, Papadopoulos V, Boussios S. Large Language Models in Oncology: Revolution or Cause for Concern? Curr Oncol 2024;31:1817-1830. [PMID: 38668040 PMCID: PMC11049602 DOI: 10.3390/curroncol31040137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 03/13/2024] [Accepted: 03/29/2024] [Indexed: 04/28/2024] Open

Deniz MS, Guler BY. Assessment of ChatGPT's adherence to ETA-thyroid nodule management guideline over two different time intervals 14 days apart: in binary and multiple-choice queries. Endocrine 2024:10.1007/s12020-024-03750-2. [PMID: 38489133 DOI: 10.1007/s12020-024-03750-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 02/15/2024] [Indexed: 03/17/2024]

Cil G, Dogan K. The efficacy of artificial intelligence in urology: a detailed analysis of kidney stone-related queries. World J Urol 2024;42:158. [PMID: 38483582 PMCID: PMC10940482 DOI: 10.1007/s00345-024-04847-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 01/24/2024] [Indexed: 03/17/2024] Open

Bains SS, Dubin JA, Hameed D, Sax OC, Douglas S, Mont MA, Nace J, Delanois RE. Use and Application of Large Language Models for Patient Questions Following Total Knee Arthroplasty. J Arthroplasty 2024:S0883-5403(24)00233-X. [PMID: 38490569 DOI: 10.1016/j.arth.2024.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 03/06/2024] [Accepted: 03/07/2024] [Indexed: 03/17/2024] Open

Abstract

BACKGROUND

A consumer-focused health care model not only allows unprecedented access to information, but equally warrants consideration of the appropriateness of providing accurate patient health information. Nurses play a large role in influencing patient satisfaction following total knee arthroplasty (TKA), but they come at a cost. A specific natural language artificial intelligence (AI) model, ChatGPT (Chat Generative Pre-trained Transformer), has accumulated over 100 million users within months of launching. As such, we aimed to compare: (1) orthopaedic surgeons' evaluation of the appropriateness of the answers to the most frequently asked patient questions after TKA; and (2) patients' comfort level in answering their postoperative questions by using answers provided by arthroplasty-trained nurses and ChatGPT.

METHODS

We prospectively created 60 questions based on the most commonly asked patient questions following TKA. There were 3 fellowship-trained surgeons who assessed the answers provided by arthroplasty-trained nurses and ChatGPT-4 to each of the questions. The surgeons graded each set of responses based on clinical judgment as: (1) "appropriate," (2) "inappropriate" if the response contained inappropriate information, or (3) "unreliable," if the responses provided inconsistent content. Patients' comfort level and trust in AI were assessed using Research Electronic Data Capture (REDCap) hosted at our local hospital.

RESULTS

The surgeons graded 44 out of 60 (73.3%) responses for the arthroplasty-trained nurses and 44 out of 60 (73.3%) for ChatGPT to be "appropriate." There were 4 responses graded "inappropriate" and one response graded "unreliable" provided by the nurses. For the ChatGPT response, there were 5 responses graded "inappropriate" and no responses graded "unreliable." There were 136 patients (53.8%) who were more comfortable with the answers provided by ChatGPT compared to 86 patients (34.0%) who preferred the answers from arthroplasty-trained nurses. Of the 253 patients, 233 (92.1%) were uncertain if they would trust AI to answer their postoperative questions. There were 127 patients (50.2%) who answered that if they knew the previous answer was provided by ChatGPT, their comfort level in trusting the answer would change.

CONCLUSIONS

One potential use of ChatGPT can be found in providing appropriate answers to patient questions after TKA. At our institution, cost expenditures can potentially be minimized while maintaining patient satisfaction. Inevitably, successful implementation is dependent on the ability to provide information that is credible and in accordance with the objectives of both physicians and patients.

LEVEL OF EVIDENCE

III.

Collapse

Park YJ, Pillai A, Deng J, Guo E, Gupta M, Paget M, Naugler C. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak 2024;24:72. [PMID: 38475802 DOI: 10.1186/s12911-024-02459-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 02/12/2024] [Indexed: 03/14/2024] Open

Abstract

IMPORTANCE

Large language models (LLMs) like OpenAI's ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base.

OBJECTIVE

This scoping review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs' clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications.

EVIDENCE REVIEW

We screened 4,036 records from MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations.

FINDINGS

Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs' effectiveness and feasibility.

CONCLUSIONS AND RELEVANCE

This review thus highlights potential future directions and questions to address these limitations and to further explore LLMs' potential in enhancing healthcare delivery.

Collapse

Lee Y, Kim SY. Potential applications of ChatGPT in obstetrics and gynecology in Korea: a review article. Obstet Gynecol Sci 2024;67:153-159. [PMID: 38247132 PMCID: PMC10948210 DOI: 10.5468/ogs.23231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/08/2023] [Accepted: 11/29/2023] [Indexed: 01/23/2024] Open

Puladi B, Gsaxner C, Kleesiek J, Hölzle F, Röhrig R, Egger J. Response to the comment on "The impact and opportunities of large language models like ChatGPT in oral and maxillofacial surgery: a narrative review". Int J Oral Maxillofac Surg 2024:S0901-5027(24)00012-2. [PMID: 38310049 DOI: 10.1016/j.ijom.2023.12.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 11/29/2023] [Accepted: 12/06/2023] [Indexed: 02/05/2024]

Kavadella A, Dias da Silva MA, Kaklamanos EG, Stamatopoulos V, Giannakopoulos K. Evaluation of ChatGPT's Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024;10:e51344. [PMID: 38111256 PMCID: PMC10867750 DOI: 10.2196/51344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 10/28/2023] [Accepted: 12/11/2023] [Indexed: 12/20/2023]

Abstract

BACKGROUND

The recent artificial intelligence tool ChatGPT seems to offer a range of benefits in academic education while also raising concerns. Relevant literature encompasses issues of plagiarism and academic dishonesty, as well as pedagogy and educational affordances; yet, no real-life implementation of ChatGPT in the educational process has been reported to our knowledge so far.

OBJECTIVE

This mixed methods study aimed to evaluate the implementation of ChatGPT in the educational process, both quantitatively and qualitatively.

METHODS

In March 2023, a total of 77 second-year dental students of the European University Cyprus were divided into 2 groups and asked to compose a learning assignment on "Radiation Biology and Radiation Protection in the Dental Office," working collaboratively in small subgroups, as part of the educational semester program of the Dentomaxillofacial Radiology module. Careful planning ensured a seamless integration of ChatGPT, addressing potential challenges. One group searched the internet for scientific resources to perform the task and the other group used ChatGPT for this purpose. Both groups developed a PowerPoint (Microsoft Corp) presentation based on their research and presented it in class. The ChatGPT group students additionally registered all interactions with the language model during the prompting process and evaluated the final outcome; they also answered an open-ended evaluation questionnaire, including questions on their learning experience. Finally, all students undertook a knowledge examination on the topic, and the grades between the 2 groups were compared statistically, whereas the free-text comments of the questionnaires were thematically analyzed.

RESULTS

Out of the 77 students, 39 were assigned to the ChatGPT group and 38 to the literature research group. Seventy students undertook the multiple choice question knowledge examination, and examination grades ranged from 5 to 10 on the 0-10 grading scale. The Mann-Whitney U test showed that students of the ChatGPT group performed significantly better (P=.045) than students of the literature research group. The evaluation questionnaires revealed the benefits (human-like interface, immediate response, and wide knowledge base), the limitations (need for rephrasing the prompts to get a relevant answer, general content, false citations, and incapability to provide images or videos), and the prospects (in education, clinical practice, continuing education, and research) of ChatGPT.

CONCLUSIONS

Students using ChatGPT for their learning assignments performed significantly better in the knowledge examination than their fellow students who used the literature research methodology. Students adapted quickly to the technological environment of the language model, recognized its opportunities and limitations, and used it creatively and efficiently. Implications for practice: the study underscores the adaptability of students to technological innovations including ChatGPT and its potential to enhance educational outcomes. Educators should consider integrating ChatGPT into curriculum design; awareness programs are warranted to educate both students and educators about the limitations of ChatGPT, encouraging critical engagement and responsible use.

Collapse

Odabashian R, Bastin D, Jones G, Manzoor M, Tangestaniapour S, Assad M, Lakhani S, Odabashian M, McGee S. Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks. JMIR AI 2024;3:e50442. [PMID: 38875575 PMCID: PMC11041475 DOI: 10.2196/50442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 10/05/2023] [Accepted: 11/19/2023] [Indexed: 06/16/2024]

Abstract

BACKGROUND

ChatGPT (Open AI) is a state-of-the-art large language model that uses artificial intelligence (AI) to address questions across diverse topics. The American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) created a comprehensive educational program to help physicians keep up to date with the many rapid advances in the field. The question bank consists of multiple choice questions addressing the many facets of cancer care, including diagnosis, treatment, and supportive care. As ChatGPT applications rapidly expand, it becomes vital to ascertain if the knowledge of ChatGPT-3.5 matches the established standards that oncologists are recommended to follow.

OBJECTIVE

This study aims to evaluate whether ChatGPT-3.5's knowledge aligns with the established benchmarks that oncologists are expected to adhere to. This will furnish us with a deeper understanding of the potential applications of this tool as a support for clinical decision-making.

METHODS

We conducted a systematic assessment of the performance of ChatGPT-3.5 on the ASCO-SEP, the leading educational and assessment tool for medical oncologists in training and practice. Over 1000 multiple choice questions covering the spectrum of cancer care were extracted. Questions were categorized by cancer type or discipline, with subcategorization as treatment, diagnosis, or other. Answers were scored as correct if ChatGPT-3.5 selected the answer as defined by ASCO-SEP.

RESULTS

Overall, ChatGPT-3.5 achieved a score of 56.1% (583/1040) for the correct answers provided. The program demonstrated varying levels of accuracy across cancer types or disciplines. The highest accuracy was observed in questions related to developmental therapeutics (8/10; 80% correct), while the lowest accuracy was observed in questions related to gastrointestinal cancer (102/209; 48.8% correct). There was no significant difference in the program's performance across the predefined subcategories of diagnosis, treatment, and other (P=.16, which is greater than .05).

CONCLUSIONS

This study evaluated ChatGPT-3.5's oncology knowledge using the ASCO-SEP, aiming to address uncertainties regarding AI tools like ChatGPT in clinical decision-making. Our findings suggest that while ChatGPT-3.5 offers a hopeful outlook for AI in oncology, its present performance in ASCO-SEP tests necessitates further refinement to reach the requisite competency levels. Future assessments could explore ChatGPT's clinical decision support capabilities with real-world clinical scenarios, its ease of integration into medical workflows, and its potential to foster interdisciplinary collaboration and patient engagement in health care settings.

Collapse

Zheng Y, Sun X, Feng B, Kang K, Yang Y, Zhao A, Wu Y. Rare and complex diseases in focus: ChatGPT's role in improving diagnosis and treatment. Front Artif Intell 2024;7:1338433. [PMID: 38283995 PMCID: PMC10808657 DOI: 10.3389/frai.2024.1338433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/02/2024] [Indexed: 01/30/2024] Open

Padovan M, Cosci B, Petillo A, Nerli G, Porciatti F, Scarinci S, Carlucci F, Dell’Amico L, Meliani N, Necciari G, Lucisano VC, Marino R, Foddis R, Palla A. ChatGPT in Occupational Medicine: A Comparative Study with Human Experts. Bioengineering (Basel) 2024;11:57. [PMID: 38247934 PMCID: PMC10813435 DOI: 10.3390/bioengineering11010057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 01/01/2024] [Accepted: 01/04/2024] [Indexed: 01/23/2024] Open

Affiliation(s)

Martina Padovan Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Bianca Cosci Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Armando Petillo Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Gianluca Nerli Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Francesco Porciatti Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Sergio Scarinci Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Francesco Carlucci Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Letizia Dell’Amico Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Niccolò Meliani Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Gabriele Necciari Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Vincenzo Carmelo Lucisano Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Riccardo Marino Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Rudy Foddis Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
Alessandro Palla Intel Corporation, Santa Clara, CA 95054, USA;

Collapse

Younis HA, Eisa TAE, Nasser M, Sahib TM, Noor AA, Alyasiri OM, Salisu S, Hayder IM, Younis HA. A Systematic Review and Meta-Analysis of Artificial Intelligence Tools in Medicine and Healthcare: Applications, Considerations, Limitations, Motivation and Challenges. Diagnostics (Basel) 2024;14:109. [PMID: 38201418 PMCID: PMC10802884 DOI: 10.3390/diagnostics14010109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 12/02/2023] [Accepted: 12/04/2023] [Indexed: 01/12/2024] Open

Morales-Ramirez P, Mishek H, Dasgupta A. The Genie Is Out of the Bottle: What ChatGPT Can and Cannot Do for Medical Professionals. Obstet Gynecol 2024;143:e1-e6. [PMID: 37944140 DOI: 10.1097/aog.0000000000005446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/12/2023] [Indexed: 11/12/2023]

Mannstadt I, Mehta B. Large language models and the future of rheumatology: assessing impact and emerging opportunities. Curr Opin Rheumatol 2024;36:46-51. [PMID: 37729050 DOI: 10.1097/bor.0000000000000981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]

Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J Med Internet Res 2023;25:e51580. [PMID: 38009003 PMCID: PMC10784979 DOI: 10.2196/51580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 10/15/2023] [Accepted: 11/20/2023] [Indexed: 11/28/2023] Open

Abstract

BACKGROUND

The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including dentistry, raises questions about their accuracy.

OBJECTIVE

This study aims to comparatively evaluate the answers provided by 4 LLMs, namely Bard (Google LLC), ChatGPT-3.5 and ChatGPT-4 (OpenAI), and Bing Chat (Microsoft Corp), to clinically relevant questions from the field of dentistry.

METHODS

The LLMs were queried with 20 open-type, clinical dentistry-related questions from different disciplines, developed by the respective faculty of the School of Dentistry, European University Cyprus. The LLMs' answers were graded 0 (minimum) to 10 (maximum) points against strong, traditionally collected scientific evidence, such as guidelines and consensus statements, using a rubric, as if they were examination questions posed to students, by 2 experienced faculty members. The scores were statistically compared to identify the best-performing model using the Friedman and Wilcoxon tests. Moreover, the evaluators were asked to provide a qualitative evaluation of the comprehensiveness, scientific accuracy, clarity, and relevance of the LLMs' answers.

RESULTS

Overall, no statistically significant difference was detected between the scores given by the 2 evaluators; therefore, an average score was computed for every LLM. Although ChatGPT-4 statistically outperformed ChatGPT-3.5 (P=.008), Bing Chat (P=.049), and Bard (P=.045), all models occasionally exhibited inaccuracies, generality, outdated content, and a lack of source references. The evaluators noted instances where the LLMs delivered irrelevant information, vague answers, or information that was not fully accurate.

CONCLUSIONS

This study demonstrates that although LLMs hold promising potential as an aid in the implementation of evidence-based dentistry, their current limitations can lead to potentially harmful health care decisions if not used judiciously. Therefore, these tools should not replace the dentist's critical thinking and in-depth understanding of the subject matter. Further research, clinical validation, and model improvements are necessary for these tools to be fully integrated into dental practice. Dental practitioners must be aware of the limitations of LLMs, as their imprudent use could potentially impact patient care. Regulatory measures should be established to oversee the use of these evolving technologies.

Collapse