1
|
Liberatore G, Kim A, Brenner J, Milanaik R. Artificial intelligence impacts in education and pediatric mental health. Curr Opin Pediatr 2025; 37:296-302. [PMID: 40105197 DOI: 10.1097/mop.0000000000001453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]
Abstract
PURPOSE OF REVIEW Increased accessibility of artificial intelligence to children has raised concerns regarding its effects on education and student mental health. Pediatricians should continue to be informed about the effects of artificial intelligence in their patients' daily lives, as artificial intelligence is becoming increasingly present. RECENT FINDINGS The use of artificial intelligence to create personalized study material illustrates a benefit of incorporating this technology into education. However, an overreliance on artificial intelligence could decrease students' problem-solving skills and increase plagiarism. Novel uses of artificial intelligence have also raised concerns regarding mental health. Deepfake technology, which utilizes artificial intelligence to create images, videos, and/or audio that appears real but is fabricated, can be viewed online by children, which could have negative mental health implications. SUMMARY Although artificial intelligence has the potential to revolutionize education at all levels, its use as an enhancement, not replacement, to current educational strategies is imperative. Both parents and students need to understand the limitations of artificial intelligence in education, and simultaneously prioritize developing the necessary cognitive skills strengthened throughout education. Pediatricians and parents should also be aware of the potentially dangerous material generated by artificial intelligence that can negatively impact children's mental health.
Collapse
Affiliation(s)
- Grace Liberatore
- Cohen Children's Medical Center, Developmental and Behavioral Pediatrics, Northwell, New Hyde Park, New York, USA
| | | | | | | |
Collapse
|
2
|
Liberatore G, Brenner J, Franco J, Milanaik R. The potential of artificial intelligence to transform medicine. Curr Opin Pediatr 2025; 37:289-295. [PMID: 40327354 DOI: 10.1097/mop.0000000000001452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/07/2025]
Abstract
PURPOSE OF REVIEW Increased incorporation of artificial intelligence in medicine has raised questions regarding how it can enhance efficiency in concert with providing accurate medical information without violating patient privacy. Pediatricians should understand the impact of AI in terms of both their daily practice and the changing landscape of the medical field. RECENT FINDINGS Computer vision modeling and large language models have been designed for diagnostic and predictive health outcomes purposes; yet many still lack external validity and reliability. Artificial intelligence can also increase efficiency in electronic health record documentation. Despite potential benefits, legal and ethical concerns are raised with patient data that is stored and used by artificial intelligence models. More research is recommended before artificial intelligence is fully implemented into medical practice. SUMMARY Utilizing artificial intelligence in medical practice and medical education as supplemental tools, rather than in replacement of traditional methods, may result in more efficient medical practice and enhanced methods of studying. Yet, there needs to be a balance such that overreliance does not result in automatic trusting of potentially misinformation. Increased oversight and regulation of artificial intelligence in medicine is crucial to ensure legal and ethical approaches that protect patient privacy.
Collapse
Affiliation(s)
- Grace Liberatore
- Cohen Children's Medical Center, Developmental and Behavioral Pediatrics, Northwell, New Hyde Park, New York, USA
| | | | | | | |
Collapse
|
3
|
Yip R, Sun YJ, Bassuk AG, Mahajan VB. Artificial intelligence's contribution to biomedical literature search: revolutionizing or complicating? PLOS DIGITAL HEALTH 2025; 4:e0000849. [PMID: 40354425 PMCID: PMC12068611 DOI: 10.1371/journal.pdig.0000849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 04/03/2025] [Indexed: 05/14/2025]
Abstract
There is a growing number of articles about conversational AI (i.e., ChatGPT) for generating scientific literature reviews and summaries. Yet, comparative evidence lags its wide adoption by many clinicians and researchers. We explored ChatGPT's utility for literature search from an end-user perspective through the lens of clinicians and biomedical researchers. We quantitatively compared basic versions of ChatGPT's utility against conventional search methods such as Google and PubMed. We further tested whether ChatGPT user-support tools (i.e., plugins, web-browsing function, prompt-engineering, and custom-GPTs) could improve its response across four common and practical literature search scenarios: (1) high-interest topics with an abundance of information, (2) niche topics with limited information, (3) scientific hypothesis generation, and (4) for newly emerging clinical practices questions. Our results demonstrated that basic ChatGPT functions had limitations in consistency, accuracy, and relevancy. User-support tools showed improvements, but the limitations persisted. Interestingly, each literature search scenario posed different challenges: an abundance of secondary information sources in high interest topics, and uncompelling literatures for new/niche topics. This study tested practical examples highlighting both the potential and the pitfalls of integrating conversational AI into literature search processes, and underscores the necessity for rigorous comparative assessments of AI tools in scientific research.
Collapse
Affiliation(s)
- Rui Yip
- Molecular Surgery Laboratory, Stanford University, Palo Alto, California, United States of America
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, California, United States of America
| | - Young Joo Sun
- Molecular Surgery Laboratory, Stanford University, Palo Alto, California, United States of America
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, California, United States of America
| | - Alexander G. Bassuk
- Department of Pediatrics, University of Iowa, Iowa City, Iowa, United States of America
| | - Vinit B. Mahajan
- Molecular Surgery Laboratory, Stanford University, Palo Alto, California, United States of America
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, California, United States of America
- Veterans Affairs Palo Alto Health Care System, Palo Alto, California, United States of America
| |
Collapse
|
4
|
Rollano C, Pérez-González JC, Román-González M. Induced citation analysis: Development and application of a measurement instrument in a systematic review on role-playing games. Account Res 2025:1-15. [PMID: 40265379 DOI: 10.1080/08989621.2025.2495287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Accepted: 04/15/2025] [Indexed: 04/24/2025]
Abstract
BACKGROUND The pressure to publish and the dynamics of academic publishing have increased the prevalence of paper mills, citation sales, plagiarism, and academic post-truth, posing a risk to academic integrity. OBJECTIVE The aim of this study is to develop and validate the induced citation checklist to measure the risks introduced by key citations. Methods. To develop this tool, key citations were extracted from a systematic review of role-playing games. An inductive and iterative thematic analysis was performed on these citations. The checklist was applied, and the results were summarized in the induced citation graph. RESULTS The final product, the induced citation checklist, contains five categories, and its application identified widespread issues with citation practices. The most common problem was a lack of 20 empirical foundation. Meanwhile, the induced citation graph provides an intuitive summary of the of the results. CONCLUSION This study highlights the need to consider the biases introduced through citations. In this regard, the induced citation checklist is presented as a valuable tool for improving academic integrity and research practices, and it is simple to apply. The applicability of the checklist extends beyond role-playing games and systematic reviews; therefore, future research should expand its validation across different disciplines.
Collapse
Affiliation(s)
- Cecilia Rollano
- International Doctoral School, Universidad Nacional de Educacion a Distancia, Madrid, Spain
| | | | | |
Collapse
|
5
|
Li R, Wu T. Delving into the Practical Applications and Pitfalls of Large Language Models in Medical Education: Narrative Review. ADVANCES IN MEDICAL EDUCATION AND PRACTICE 2025; 16:625-636. [PMID: 40271151 PMCID: PMC12015179 DOI: 10.2147/amep.s497020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Accepted: 04/10/2025] [Indexed: 04/25/2025]
Abstract
Large language models (LLMs) have emerged as valuable tools in medical education, attracting substantial attention in recent years. They offer educators essential support in developing instructional plans, generating interactive materials, and facilitating efficient feedback mechanisms. Furthermore, LLMs enhance students' language acquisition, writing proficiency, and creativity in educational activities. This review aims to examine the practical applications of LLMs in enhancing the educational and academic performance of both teachers and students, providing specific examples to demonstrate their effectiveness. Additionally, we address the inherent challenges associated with LLM implementation and propose viable solutions to optimize their use. Our study lays the groundwork for the broader integration of LLMs in medical education and research, ensuring the highest standards of medical learning and, ultimately, patient safety.
Collapse
Affiliation(s)
- Rui Li
- Emergency Department, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, People’s Republic of China
| | - Tong Wu
- National Clinical Research Center for Obstetrical and Gynecological Diseases, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, People’s Republic of China
- Key Laboratory of Cancer Invasion and Metastasis, Ministry of Education, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, People’s Republic of China
- Department of Obstetrics and Gynecology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, People’s Republic of China
| |
Collapse
|
6
|
Resnik DB, Hosseini M. The ethics of using artificial intelligence in scientific research: new guidance needed for a new tool. AI AND ETHICS 2025; 5:1499-1521. [PMID: 40337745 PMCID: PMC12057767 DOI: 10.1007/s43681-024-00493-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 05/07/2024] [Indexed: 05/09/2025]
Abstract
Using artificial intelligence (AI) in research offers many important benefits for science and society but also creates novel and complex ethical issues. While these ethical issues do not necessitate changing established ethical norms of science, they require the scientific community to develop new guidance for the appropriate use of AI. In this article, we briefly introduce AI and explain how it can be used in research, examine some of the ethical issues raised when using it, and offer nine recommendations for responsible use, including: (1) Researchers are responsible for identifying, describing, reducing, and controlling AI-related biases and random errors; (2) Researchers should disclose, describe, and explain their use of AI in research, including its limitations, in language that can be understood by non-experts; (3) Researchers should engage with impacted communities, populations, and other stakeholders concerning the use of AI in research to obtain their advice and assistance and address their interests and concerns, such as issues related to bias; (4) Researchers who use synthetic data should (a) indicate which parts of the data are synthetic; (b) clearly label the synthetic data; (c) describe how the data were generated; and (d) explain how and why the data were used; (5) AI systems should not be named as authors, inventors, or copyright holders but their contributions to research should be disclosed and described; (6) Education and mentoring in responsible conduct of research should include discussion of ethical use of AI.
Collapse
Affiliation(s)
- David B. Resnik
- National Institute of Environmental Health Sciences, Durham, USA
| | - Mohammad Hosseini
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL USA
- Galter Health Sciences Library and Learning Center, Northwestern University Feinberg School of Medicine, Chicago, IL USA
| |
Collapse
|
7
|
Bailey HE, Carter-Templeton H, Peterson GM, Oermann MH, Owens JK. Prevalence of Words and Phrases Associated With Large Language Model-Generated Text in the Nursing Literature. Comput Inform Nurs 2025; 43:e01237. [PMID: 39745875 DOI: 10.1097/cin.0000000000001237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2025]
Abstract
All disciplines, including nursing, may be experiencing significant changes with the advent of free, publicly available generative artificial intelligence tools. Recent research has shown the difficulty in distinguishing artificial intelligence-generated text from content that is written by humans, thereby increasing the probability for unverified information shared in scholarly works. The purpose of this study was to determine the extent of generative artificial intelligence usage in published nursing articles. The Dimensions database was used to collect articles with at least one appearance of words and phrases associated with generative artificial intelligence. These articles were then searched for words or phrases known to be disproportionately associated with large language model-based generative artificial intelligence. Several nouns, verbs, adverbs, and phrases had remarkable increases in appearance starting in 2023, suggesting use of generative artificial intelligence. Nurses, authors, reviewers, and editors will likely encounter generative artificial intelligence in their work. Although these sophisticated and emerging tools are promising, we must continue to work toward developing ways to verify accuracy of their content, develop policies that insist on transparent use, and safeguard consumers of the evidence they generate.
Collapse
Affiliation(s)
- Hannah E Bailey
- Author Affiliations: Data Driven WV, John Chambers College of Business and Economics (Ms Bailey), and School of Nursing, West Virginia University (Dr Carter-Templeton), Morgantown; School of Library and Information Sciences, North Carolina Central University, Durham (Dr Peterson); Duke University School of Nursing, Durham, NC (Dr Oermann); Dwight Schar College of Nursing and Health Sciences, Ashland University, OH (Dr Owens)
| | | | | | | | | |
Collapse
|
8
|
Yan C, Li Z, Liang Y, Shao S, Ma F, Zhang N, Li B, Wang C, Zhou K. Assessing large language models as assistive tools in medical consultations for Kawasaki disease. Front Artif Intell 2025; 8:1571503. [PMID: 40231209 PMCID: PMC11994668 DOI: 10.3389/frai.2025.1571503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2025] [Accepted: 03/06/2025] [Indexed: 04/16/2025] Open
Abstract
Background Kawasaki disease (KD) presents complex clinical challenges in diagnosis, treatment, and long-term management, requiring a comprehensive understanding by both parents and healthcare providers. With advancements in artificial intelligence (AI), large language models (LLMs) have shown promise in supporting medical practice. This study aims to evaluate and compare the appropriateness and comprehensibility of different LLMs in answering clinically relevant questions about KD and assess the impact of different prompting strategies. Methods Twenty-five questions were formulated, incorporating three prompting strategies: No prompting (NO), Parent-friendly (PF), and Doctor-level (DL). These questions were input into three LLMs: ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Responses were evaluated based on appropriateness, educational quality, comprehensibility, cautionary statements, references, and potential misinformation, using Information Quality Grade, Global Quality Scale (GQS), Flesch Reading Ease (FRE) score, and word count. Results Significant differences were found among the LLMs in terms of response educational quality, accuracy, and comprehensibility (p < 0.001). Claude 3.5 provided the highest proportion of completely correct responses (51.1%) and achieved the highest median GQS score (5.0), outperforming GPT-4o (4.0) and Gemini 1.5 (3.0) significantly. Gemini 1.5 achieved the highest FRE score (31.5) and provided highest proportion of responses assessed as comprehensible (80.4%). Prompting strategies significantly affected LLM responses. Claude 3.5 Sonnet with DL prompting had the highest completely correct rate (81.3%), while PF prompting yielded the most acceptable responses (97.3%). Gemini 1.5 Pro showed minimal variation across prompts but excelled in comprehensibility (98.7% under PF prompting). Conclusion This study indicates that LLMs have great potential in providing information about KD, but their use requires caution due to quality inconsistencies and misinformation risks. Significant discrepancies existed across LLMs and prompting strategies. Claude 3.5 Sonnet offered the best response quality and accuracy, while Gemini 1.5 Pro excelled in comprehensibility. PF prompting with Claude 3.5 Sonnet is most recommended for parents seeking KD information. As AI evolves, expanding research and refining models is crucial to ensure reliable, high-quality information.
Collapse
Affiliation(s)
- Chunyi Yan
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Zexi Li
- Department of Cardiology, West China Hospital, Sichuan University, Chengdu, China
| | - Yongzhou Liang
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Shuran Shao
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Fan Ma
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Nanjun Zhang
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Bowen Li
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Chuan Wang
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Kaiyu Zhou
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| |
Collapse
|
9
|
Ostrovsky AM. Evaluating a large language model's accuracy in chest X-ray interpretation for acute thoracic conditions. Am J Emerg Med 2025; 93:99-102. [PMID: 40174466 DOI: 10.1016/j.ajem.2025.03.060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2025] [Revised: 03/26/2025] [Accepted: 03/26/2025] [Indexed: 04/04/2025] Open
Abstract
BACKGROUND The rapid advancement of artificial intelligence (AI) has great ability to impact healthcare. Chest X-rays are essential for diagnosing acute thoracic conditions in the emergency department (ED), but interpretation delays due to radiologist availability can impact clinical decision-making. AI models, including deep learning algorithms, have been explored for diagnostic support, but the potential of large language models (LLMs) in emergency radiology remains largely unexamined. METHODS This study assessed ChatGPT's feasibility in interpreting chest X-rays for acute thoracic conditions commonly encountered in the ED. A subset of 1400 images from the NIH Chest X-ray dataset was analyzed, representing seven pathology categories: Atelectasis, Effusion, Emphysema, Pneumothorax, Pneumonia, Mass, and No Finding. ChatGPT 4.0, utilizing the "X-Ray Interpreter" add-on, was evaluated for its diagnostic performance across these categories. RESULTS ChatGPT demonstrated high performance in identifying normal chest X-rays, with a sensitivity of 98.9 %, specificity of 93.9 %, and accuracy of 94.7 %. However, the model's performance varied across pathologies. The best results were observed in diagnosing pneumonia (sensitivity 76.2 %, specificity 93.7 %) and pneumothorax (sensitivity 77.4 %, specificity 89.1 %), while performance for atelectasis and emphysema was lower. CONCLUSION ChatGPT demonstrates potential as a supplementary tool for differentiating normal from abnormal chest X-rays, with promising results for certain pathologies like pneumonia. However, its diagnostic accuracy for more subtle conditions requires improvement. Further research integrating ChatGPT with specialized image recognition models could enhance its performance, offering new possibilities in medical imaging and education.
Collapse
Affiliation(s)
- Adam M Ostrovsky
- Sidney Kimmel Medical College at Thomas Jefferson University, Philadelphia, PA, USA.
| |
Collapse
|
10
|
Luo Z, Qiao Y, Xu X, Li X, Xiao M, Kang A, Wang D, Pang Y, Xie X, Xie S, Luo D, Ding X, Liu Z, Liu Y, Hu A, Ren Y, Xie J. Cross sectional pilot study on clinical review generation using large language models. NPJ Digit Med 2025; 8:170. [PMID: 40108444 PMCID: PMC11923074 DOI: 10.1038/s41746-025-01535-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 02/21/2025] [Indexed: 03/22/2025] Open
Abstract
As the volume of medical literature accelerates, necessitating efficient tools to synthesize evidence for clinical practice and research, the interest in leveraging large language models (LLMs) for generating clinical reviews has surged. However, there are significant concerns regarding the reliability associated with integrating LLMs into the clinical review process. This study presents a systematic comparison between LLM-generated and human-authored clinical reviews, revealing that while AI can quickly produce reviews, it often has fewer references, less comprehensive insights, and lower logical consistency while exhibiting lower authenticity and accuracy in their citations. Additionally, a higher proportion of its references are from lower-tier journals. Moreover, the study uncovers a concerning inefficiency in current detection systems for identifying AI-generated content, suggesting a need for more advanced checking systems and a stronger ethical framework to ensure academic transparency. Addressing these challenges is vital for the responsible integration of LLMs into clinical research.
Collapse
Affiliation(s)
- Zining Luo
- Department of Gastrointestinal Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China
- School of Electronics & Electrical Engineering, University of Glasgow, Glasgow, Scotland, UK
- School of Information & Communication Engineering, University of Electronic Science & Technology, Chengdu, Sichuan, China
- School of Basic Medicine and School of Forensic Medicine, North Sichuan Medical College, Nanchong, Sichuan, China
- Department of Stomatology, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Yang Qiao
- Department of Biomedical Engineering, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Xinyu Xu
- Department of Stomatology, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Xiangyu Li
- Department of Stomatology, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Mengyan Xiao
- Department of Stomatology, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Aijia Kang
- Department of Aesthesia, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Dunrui Wang
- School of Electronics & Electrical Engineering, University of Glasgow, Glasgow, Scotland, UK
- School of Information & Communication Engineering, University of Electronic Science & Technology, Chengdu, Sichuan, China
| | - Yueshan Pang
- Department of Geriatrics, The Second Clinical Medical College of North Sichuan Medical College, Nanchong Central Hospital, Nanchong, Sichuan, China
| | - Xing Xie
- Department of Gastrointestinal Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China
- Department of General Surgery, and Institute of Hepato-Biliary-Pancreas and Intestinal Disease, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China
| | - Sijun Xie
- Department of Gastrointestinal Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China
- Department of General Surgery, and Institute of Hepato-Biliary-Pancreas and Intestinal Disease, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China
| | - Dachen Luo
- Department of Respiratory and Critical Care Medicine, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China
| | - Xuefeng Ding
- Department of Critical Care Medicine, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China
| | - Zhenglong Liu
- School of Basic Medicine and School of Forensic Medicine, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Ying Liu
- Department of Stomatology, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Aimin Hu
- Department of Foreign Languages and Culture, North Sichuan Medical College, Nanchong, Sichuan, China
| | - Yixing Ren
- Department of Gastrointestinal Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China.
- Department of General Surgery, and Institute of Hepato-Biliary-Pancreas and Intestinal Disease, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China.
| | - Jiebin Xie
- Department of Gastrointestinal Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China.
- Department of General Surgery, and Institute of Hepato-Biliary-Pancreas and Intestinal Disease, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China.
| |
Collapse
|
11
|
Khabaz K, Newman‐Hung NJ, Kallini JR, Kendal J, Christ AB, Bernthal NM, Wessel LE. Assessment of Artificial Intelligence Chatbot Responses to Common Patient Questions on Bone Sarcoma. J Surg Oncol 2025; 131:719-724. [PMID: 39470681 PMCID: PMC12065442 DOI: 10.1002/jso.27966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 10/04/2024] [Accepted: 10/12/2024] [Indexed: 10/30/2024]
Abstract
BACKGROUND AND OBJECTIVES The potential impacts of artificial intelligence (AI) chatbots on care for patients with bone sarcoma is poorly understood. Elucidating potential risks and benefits would allow surgeons to define appropriate roles for these tools in clinical care. METHODS Eleven questions on bone sarcoma diagnosis, treatment, and recovery were inputted into three AI chatbots. Answers were assessed on a 5-point Likert scale for five clinical accuracy metrics: relevance to the question, balance and lack of bias, basis on established data, factual accuracy, and completeness in scope. Responses were quantitatively assessed for empathy and readability. The Patient Education Materials Assessment Tool (PEMAT) was assessed for understandability and actionability. RESULTS Chatbots scored highly on relevance (4.24) and balance/lack of bias (4.09) but lower on basing responses on established data (3.77), completeness (3.68), and factual accuracy (3.66). Responses generally scored well on understandability (84.30%), while actionability scores were low for questions on treatment (64.58%) and recovery (60.64%). GPT-4 exhibited the highest empathy (4.12). Readability scores averaged between 10.28 for diagnosis questions to 11.65 for recovery questions. CONCLUSIONS While AI chatbots are promising tools, current limitations in factual accuracy and completeness, as well as concerns of inaccessibility to populations with lower health literacy, may significantly limit their clinical utility.
Collapse
Affiliation(s)
- Kameel Khabaz
- David Geffen School of Medicine at UCLALos AngelesCaliforniaUSA
| | | | - Jennifer R. Kallini
- Department of Orthopaedic SurgeryUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Joseph Kendal
- Department of SurgeryUniversity of CalgaryCalgaryAlbertaCanada
| | - Alexander B. Christ
- Department of Orthopaedic SurgeryUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Nicholas M. Bernthal
- Department of Orthopaedic SurgeryUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Lauren E. Wessel
- Department of Orthopaedic SurgeryUniversity of CaliforniaLos AngelesCaliforniaUSA
| |
Collapse
|
12
|
On SW, Cho SW, Park SY, Ha JW, Yi SM, Park IY, Byun SH, Yang BE. Chat Generative Pre-Trained Transformer (ChatGPT) in Oral and Maxillofacial Surgery: A Narrative Review on Its Research Applications and Limitations. J Clin Med 2025; 14:1363. [PMID: 40004892 PMCID: PMC11856154 DOI: 10.3390/jcm14041363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2025] [Revised: 02/17/2025] [Accepted: 02/17/2025] [Indexed: 02/27/2025] Open
Abstract
Objectives: This review aimed to evaluate the role of ChatGPT in original research articles within the field of oral and maxillofacial surgery (OMS), focusing on its applications, limitations, and future directions. Methods: A literature search was conducted in PubMed using predefined search terms and Boolean operators to identify original research articles utilizing ChatGPT published up to October 2024. The selection process involved screening studies based on their relevance to OMS and ChatGPT applications, with 26 articles meeting the final inclusion criteria. Results: ChatGPT has been applied in various OMS-related domains, including clinical decision support in real and virtual scenarios, patient and practitioner education, scientific writing and referencing, and its ability to answer licensing exam questions. As a clinical decision support tool, ChatGPT demonstrated moderate accuracy (approximately 70-80%). It showed moderate to high accuracy (up to 90%) in providing patient guidance and information. However, its reliability remains inconsistent across different applications, necessitating further evaluation. Conclusions: While ChatGPT presents potential benefits in OMS, particularly in supporting clinical decisions and improving access to medical information, it should not be regarded as a substitute for clinicians and must be used as an adjunct tool. Further validation studies and technological refinements are required to enhance its reliability and effectiveness in clinical and research settings.
Collapse
Affiliation(s)
- Sung-Woon On
- Division of Oral and Maxillofacial Surgery, Department of Dentistry, Dongtan Sacred Heart Hospital, Hallym University College of Medicine, Hwaseong 18450, Republic of Korea; (S.-W.O.); (J.-W.H.)
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Seoung-Won Cho
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Sang-Yoon Park
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - Ji-Won Ha
- Division of Oral and Maxillofacial Surgery, Department of Dentistry, Dongtan Sacred Heart Hospital, Hallym University College of Medicine, Hwaseong 18450, Republic of Korea; (S.-W.O.); (J.-W.H.)
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Sang-Min Yi
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - In-Young Park
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
- Department of Orthodontics, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
| | - Soo-Hwan Byun
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - Byoung-Eun Yang
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| |
Collapse
|
13
|
Binz M, Alaniz S, Roskies A, Aczel B, Bergstrom CT, Allen C, Schad D, Wulff D, West JD, Zhang Q, Shiffrin RM, Gershman SJ, Popov V, Bender EM, Marelli M, Botvinick MM, Akata Z, Schulz E. How should the advancement of large language models affect the practice of science? Proc Natl Acad Sci U S A 2025; 122:e2401227121. [PMID: 39869798 PMCID: PMC11804466 DOI: 10.1073/pnas.2401227121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2025] Open
Abstract
Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advancement of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and overhyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.
Collapse
Affiliation(s)
- Marcel Binz
- Max Planck Institute for Biological Cybernetics, Tübingen, Baden-Württemberg72076, Germany
- Helmholtz Center for Computational Health, Munich, Oberschleißheim, Bayern85764, Germany
| | - Stephan Alaniz
- Helmholtz Center for Computational Health, Munich, Oberschleißheim, Bayern85764, Germany
- Department of Computer Science, Technical University of Munich, München, Bayern80333, Germany
- Munich Center for Machine Learning, München, Bayern80333, Germany
| | - Adina Roskies
- Department of Psychological and Brain Sciences, University of California, Santa Barbara, CA93106
| | - Balazs Aczel
- Institute of Psychology, Eötvös Loránd University, Budapest1053, Hungary
| | - Carl T. Bergstrom
- Department of Linguistics, University of Washington, Seattle, WA98195
| | - Colin Allen
- Department of Psychological and Brain Sciences, University of California, Santa Barbara, CA93106
| | - Daniel Schad
- Psychology Department and Institute of Mind, Brain and Behavior, Health and Medical University, Potsdam, Brandenburg14471, Germany
| | - Dirk Wulff
- Max-Planck-Institute for Human Development, Berlin14195, Germany
- Center for Cognitive and Decision Science, University of Basel, Basel4001, Switzerland
| | - Jevin D. West
- Department of Linguistics, University of Washington, Seattle, WA98195
| | - Qiong Zhang
- Department of Psychology, Rutgers University, New Brunswick, NJ08901
| | - Richard M. Shiffrin
- Department of Psychological and Brain Sciences, Indiana University, Bloomington, IN47408
| | | | - Vencislav Popov
- Department of Psychology, University of Zurich, Zurich8006, Switzerland
| | - Emily M. Bender
- Department of Linguistics, University of Washington, Seattle, WA98195
| | - Marco Marelli
- Department of Psychology, University of Milano-Bicocca, Milano20126, Italy
| | - Matthew M. Botvinick
- Section/Unit, Google DeepMind, LondonN1C4AG, United Kingdom
- Gatsby Computational Neuroscience Unit, University College London, LondonWC1E 6BT, United Kingdom
| | - Zeynep Akata
- Helmholtz Center for Computational Health, Munich, Oberschleißheim, Bayern85764, Germany
- Department of Computer Science, Technical University of Munich, München, Bayern80333, Germany
- Munich Center for Machine Learning, München, Bayern80333, Germany
| | - Eric Schulz
- Max Planck Institute for Biological Cybernetics, Tübingen, Baden-Württemberg72076, Germany
- Helmholtz Center for Computational Health, Munich, Oberschleißheim, Bayern85764, Germany
| |
Collapse
|
14
|
Zhu K, Zhang J, Klishin A, Esser M, Blumentals WA, Juhaeri J, Jouquelet‐Royer C, Sinnott S. Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology. Pharmacoepidemiol Drug Saf 2025; 34:e70111. [PMID: 39901360 PMCID: PMC11791122 DOI: 10.1002/pds.70111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 01/15/2025] [Accepted: 01/16/2025] [Indexed: 02/05/2025]
Abstract
PURPOSE Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency. METHODS A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting "gold-standard" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated. RESULTS Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references. CONCLUSIONS ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.
Collapse
Affiliation(s)
- Kexin Zhu
- Epidemiology and Benefit Risk, SanofiBridgewaterNew JerseyUSA
| | | | | | - Mario Esser
- Global PharmacovigilanceSanofiFrankfurtGermany
| | | | - Juhaeri Juhaeri
- Epidemiology and Benefit Risk, SanofiBridgewaterNew JerseyUSA
| | | | | |
Collapse
|
15
|
Koirala P, Thongprayoon C, Miao J, Garcia Valencia OA, Sheikh MS, Suppadungsuk S, Mao MA, Pham JH, Craici IM, Cheungpasitporn W. Evaluating AI performance in nephrology triage and subspecialty referrals. Sci Rep 2025; 15:3455. [PMID: 39870788 PMCID: PMC11772766 DOI: 10.1038/s41598-025-88074-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 01/23/2025] [Indexed: 01/29/2025] Open
Abstract
Artificial intelligence (AI) has shown promise in revolutionizing medical triage, particularly in the context of the rising prevalence of kidney-related conditions with the aging global population. This study evaluates the utility of ChatGPT, a large language model, in triaging nephrology cases through simulated real-world scenarios. Two nephrologists created 100 patient cases that encompassed various aspects of nephrology. ChatGPT's performance in determining the appropriateness of nephrology consultations and identifying suitable nephrology subspecialties was assessed. The results demonstrated high accuracy; ChatGPT correctly determined the need for nephrology in 99-100% of cases, and it accurately identified the most suitable nephrology subspecialty triage in 96-99% of cases across two evaluation rounds. The agreement between the two rounds was 97%. While ChatGPT showed promise in improving medical triage efficiency and accuracy, the study also identified areas for refinement. This included the need for better integration of multidisciplinary care for patients with complex, intersecting medical conditions. This study's findings highlight the potential of AI in enhancing decision-making processes in clinical workflow, and it can inform the development of AI-assisted triage systems tailored to institution-specific practices including multidisciplinary approaches.
Collapse
Affiliation(s)
| | - Charat Thongprayoon
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Jing Miao
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Oscar A Garcia Valencia
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Mohammad S Sheikh
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Supawadee Suppadungsuk
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
- Faculty of Medicine Ramathibodi Hospital, Chakri Naruebodindra Medical Institute, Mahidol University, Samut Prakan, 10540, Thailand
| | - Michael A Mao
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Justin H Pham
- Internal Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Iasmina M Craici
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Wisit Cheungpasitporn
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA.
| |
Collapse
|
16
|
Hu X, Xu D, Zhang H, Tang M, Gao Q. Comparative diagnostic accuracy of ChatGPT-4 and machine learning in differentiating spinal tuberculosis and spinal tumors. Spine J 2025:S1529-9430(25)00015-4. [PMID: 39805470 DOI: 10.1016/j.spinee.2024.12.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 12/02/2024] [Accepted: 12/30/2024] [Indexed: 01/16/2025]
Abstract
BACKGROUND In clinical practice, distinguishing between spinal tuberculosis (STB) and spinal tumors (ST) poses a significant diagnostic challenge. The application of AI-driven large language models (LLMs) shows great potential for improving the accuracy of this differential diagnosis. PURPOSE To evaluate the performance of various machine learning models and ChatGPT-4 in distinguishing between STB and ST. STUDY DESIGN A retrospective cohort study. PATIENT SAMPLE 143 STB cases and 153 ST cases admitted to Xiangya Hospital Central South University, from January 2016 to June 2023 were collected. OUTCOME MEASURES This study incorporates basic patient information, standard laboratory results, serum tumor markers, and comprehensive imaging records, including Magnetic Resonance Imaging (MRI) and Computed Tomography (CT), for individuals diagnosed with STB and ST. Machine learning techniques and ChatGPT-4 were utilized to distinguish between STB and ST separately. METHOD Six distinct machine learning models, along with ChatGPT-4, were employed to evaluate their differential diagnostic effectiveness. RESULT Among the 6 machine learning models, the Gradient Boosting Machine (GBM) algorithm model demonstrated the highest differential diagnostic efficiency. In the training cohort, the GBM model achieved a sensitivity of 98.84% and a specificity of 100.00% in distinguishing STB from ST. In the testing cohort, its sensitivity was 98.25%, and specificity was 91.80%. ChatGPT-4 exhibited a sensitivity of 70.37% and a specificity of 90.65% for differential diagnosis. In single-question cases, ChatGPT-4's sensitivity and specificity were 71.67% and 92.55%, respectively, while in re-questioning cases, they were 44.44% and 76.92%. CONCLUSION The GBM model demonstrates significant value in the differential diagnosis of STB and ST, whereas the diagnostic performance of ChatGPT-4 remains suboptimal.
Collapse
Affiliation(s)
- Xiaojiang Hu
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; Department of Orthopedics, The Second Xiangya Hospital of Central South University, Changsha, 410011, Hunan, China
| | - Dongcheng Xu
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; Department of Spine Surgery, The Third Xiangya Hospital, Central South University, Changsha, Hunan, 410013, China
| | - Hongqi Zhang
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha 410008, China
| | - Mingxing Tang
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha 410008, China.
| | - Qile Gao
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha 410008, China.
| |
Collapse
|
17
|
Danesh A, Danesh F, Danesh A. ChatGPT's risk of misinformation in dentistry: A comparative follow-up evaluation. J Am Dent Assoc 2025; 156:3-5. [PMID: 38878024 DOI: 10.1016/j.adaj.2024.05.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 04/24/2024] [Accepted: 05/07/2024] [Indexed: 01/11/2025]
|
18
|
Porto JR, Morgan KA, Hecht CJ, Burkhart RJ, Liu RW. Quantifying the Scope of Artificial Intelligence-Assisted Writing in Orthopaedic Medical Literature: An Analysis of Prevalence and Validation of AI-Detection Software. J Am Acad Orthop Surg 2025; 33:42-50. [PMID: 39602700 DOI: 10.5435/jaaos-d-24-00084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 08/12/2024] [Indexed: 11/29/2024] Open
Abstract
INTRODUCTION The popularization of generative artificial intelligence (AI), including Chat Generative Pre-trained Transformer (ChatGPT), has raised concerns for the integrity of academic literature. This study asked the following questions: (1) Has the popularization of publicly available generative AI, such as ChatGPT, increased the prevalence of AI-generated orthopaedic literature? (2) Can AI detectors accurately identify ChatGPT-generated text? (3) Are there associations between article characteristics and the likelihood that it was AI generated? METHODS PubMed was searched across six major orthopaedic journals to identify articles received for publication after January 1, 2023. Two hundred and forty articles were randomly selected and entered into three popular AI detectors. Twenty articles published by each journal before the release of ChatGPT were randomly selected as negative control articles. 36 positive control articles (6 per journal) were created by altering 25%, 50%, and 100% of text from negative control articles using ChatGPT and were then used to validate each detector. The mean percentage of text detected as written by AI per detector was compared between pre-ChatGPT and post-ChatGPT release articles using independent t -test. Multivariate regression analysis was conducted using percentage AI-generated text per journal, article type (ie, cohort, clinical trial, review), and month of submission. RESULTS One AI detector consistently and accurately identified AI-generated text in positive control articles, whereas two others showed poor sensitivity and specificity. The most accurate detector showed a modest increase in the percentage AI detected for the articles received post release of ChatGPT (+1.8%, P = 0.01). Regression analysis showed no consistent associations between likelihood of AI-generated text per journal, article type, or month of submission. CONCLUSIONS As this study found an early, albeit modest, effect of generative AI on the orthopaedic literature, proper oversight will play a critical role in maintaining research integrity and accuracy. AI detectors may play a critical role in regulatory efforts, although they will require further development and standardization to the interpretation of their results.
Collapse
Affiliation(s)
- Joshua R Porto
- From the Department of Orthopaedic Surgery, University Hospitals of Cleveland, Case Western Reserve University, Cleveland, OH (Porto, Morgan, Hecht, Burkhart, and Liu), and the Case Western Reserve University School of Medicine, Cleveland, OH (Porto, Morgan, and Hecht)
| | | | | | | | | |
Collapse
|
19
|
Kramer RSS. Face to face: Comparing ChatGPT with human performance on face matching. Perception 2025; 54:65-68. [PMID: 39497555 DOI: 10.1177/03010066241295992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2024]
Abstract
ChatGPT's large language model, GPT-4V, has been trained on vast numbers of image-text pairs and is therefore capable of processing visual input. This model operates very differently from current state-of-the-art neural networks designed specifically for face perception and so I chose to investigate whether ChatGPT could also be applied to this domain. With this aim, I focussed on the task of face matching, that is, deciding whether two photographs showed the same person or not. Across six different tests, ChatGPT demonstrated performance that was comparable with human accuracies despite being a domain-general 'virtual assistant' rather than a specialised tool for face processing. This perhaps surprising result identifies a new avenue for exploration in this field, while further research should explore the boundaries of ChatGPT's ability, along with how its errors may relate to those made by humans.
Collapse
|
20
|
Dayan R, Uliel B, Koplewitz G. Age against the machine-susceptibility of large language models to cognitive impairment: cross sectional analysis. BMJ 2024; 387:e081948. [PMID: 39706600 DOI: 10.1136/bmj-2024-081948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/23/2024]
Abstract
OBJECTIVE To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests. DESIGN Cross sectional analysis. SETTING Online interaction with large language models via text based prompts. PARTICIPANTS Publicly available large language models, or "chatbots": ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 "Sonnet" (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet). ASSESSMENTS The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test. MAIN OUTCOME MEASURES MoCA scores, performance in visuospatial/executive tasks, and Stroop test results. RESULTS ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test. CONCLUSIONS With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: "older" chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients' confidence.
Collapse
Affiliation(s)
- Roy Dayan
- Department of Neurology, Hadassah Medical Center, Jerusalem, Israel
- Faculty of Medicine, Hebrew University, Jerusalem, Israel
| | - Benjamin Uliel
- Department of Neurology, Hadassah Medical Center, Jerusalem, Israel
- Faculty of Medicine, Hebrew University, Jerusalem, Israel
| | - Gal Koplewitz
- QuantumBlack Analytics, London, UK
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
21
|
Anbumani S, Ahunbay E. Toward the Clinically Effective Evaluation of Artificial Intelligence-Generated Responses. JCO Clin Cancer Inform 2024; 8:e2400258. [PMID: 39661915 DOI: 10.1200/cci-24-00258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 10/18/2024] [Indexed: 12/13/2024] Open
Affiliation(s)
| | - Ergun Ahunbay
- Department of Radiation Oncology, Medical College of Wisconsin, Milwaukee, WI
| |
Collapse
|
22
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|
23
|
Chatterjee S, Fruhling A, Kotiadis K, Gartner D. Towards new frontiers of healthcare systems research using artificial intelligence and generative AI. Health Syst (Basingstoke) 2024; 13:263-273. [PMID: 39584173 PMCID: PMC11580149 DOI: 10.1080/20476965.2024.2402128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2024] Open
|
24
|
Chen A, Qilleri A, Foster T, Rao AS, Gopalakrishnan S, Niezgoda J, Oropallo A. Generative Artificial Intelligence: Applications in Scientific Writing and Data Analysis in Wound Healing Research. Adv Skin Wound Care 2024; 37:601-607. [PMID: 39792511 DOI: 10.1097/asw.0000000000000226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
ABSTRACT Generative artificial intelligence (AI) models are a new technological development with vast research use cases among medical subspecialties. These powerful large language models offer a wide range of possibilities in wound care, from personalized patient support to optimized treatment plans and improved scientific writing. They can also assist in efficiently navigating the literature and selecting and summarizing articles, enabling researchers to focus on impactful studies relevant to wound care management and enhancing response quality through prompt-learning iterations. For nonnative English-speaking medical practitioners and authors, generative AI may aid in grammar and vocabulary selection. Although reports have suggested limitations of the conversational agent on medical translation pertaining to the precise interpretation of medical context, when used with verified resources, this language model can breach language barriers and promote practice-changing advancements in global wound care. Further, AI-powered chatbots can enable continuous monitoring of wound healing progress and real-time insights into treatment responses through frequent, readily available remote patient follow-ups.However, implementing AI in wound care research requires careful consideration of potential limitations, especially in accurately translating complex medical terms and workflows. Ethical considerations are vital to ensure reliable and credible wound care research when using AI technologies. Although ChatGPT shows promise for transforming wound care management, the authors warn against overreliance on the technology. Considering the potential limitations and risks, proper validation and oversight are essential to unlock its true potential while ensuring patient safety and the effectiveness of wound care treatments.
Collapse
Affiliation(s)
- Adrian Chen
- At the Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, United States, Adrian Chen, BS, Aleksandra Qilleri, BS, and Timothy Foster, BS, are Medical Students. Amit S. Rao, MD, is Project Manager, Department of Surgery, Wound Care Division, Northwell Wound Healing Center and Hyperbarics, Northwell Health, Hempstead. Sandeep Gopalakrishnan, PhD, MAPWCA, is Associate Professor and Director, Wound Healing and Tissue Repair Analytics Laboratory, School of Nursing, College of Health Professions, University of Wisconsin-Milwaukee. Jeffrey Niezgoda, MD, MAPWCA, is Founder and President Emeritus, AZH Wound Care and Hyperbaric Oxygen Therapy Center, Milwaukee, and President and Chief Medical Officer, WebCME, Greendale, Wisconsin. Alisha Oropallo, MD, is Professor of Surgery, Donald and Barbara Zucker School of Medicine and The Feinstein Institutes for Medical Research, Manhasset New York; Director, Comprehensive Wound Healing Center, Northwell Health; and Program Director, Wound and Burn Fellowship program, Northwell Health
| | | | | | | | | | | | | |
Collapse
|
25
|
Oermann MH. You Cannot Search the Literature Using Artificial Intelligence, and This Is Why. Nurs Educ Perspect 2024; 45:337. [PMID: 39400193 DOI: 10.1097/01.nep.0000000000001344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Affiliation(s)
- Marilyn H Oermann
- About the Author Marilyn H. Oermann, PhD, RN, ANEF, FAAN, editor-in-chief of Nurse Educator , is Thelma M. Ingles Professor of Nursing, Duke University School of Nursing, Durham, North Carolina. Contact her at
| |
Collapse
|
26
|
Mastrokostas PG, Mastrokostas LE, Emara AK, Wellington IJ, Ginalis E, Houten JK, Khalsa AS, Saleh A, Razi AE, Ng MK. GPT-4 as a Source of Patient Information for Anterior Cervical Discectomy and Fusion: A Comparative Analysis Against Google Web Search. Global Spine J 2024; 14:2389-2398. [PMID: 38513636 PMCID: PMC11529100 DOI: 10.1177/21925682241241241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/23/2024] Open
Abstract
STUDY DESIGN Comparative study. OBJECTIVES This study aims to compare Google and GPT-4 in terms of (1) question types, (2) response readability, (3) source quality, and (4) numerical response accuracy for the top 10 most frequently asked questions (FAQs) about anterior cervical discectomy and fusion (ACDF). METHODS "Anterior cervical discectomy and fusion" was searched on Google and GPT-4 on December 18, 2023. Top 10 FAQs were classified according to the Rothwell system. Source quality was evaluated using JAMA benchmark criteria and readability was assessed using Flesch Reading Ease and Flesch-Kincaid grade level. Differences in JAMA scores, Flesch-Kincaid grade level, Flesch Reading Ease, and word count between platforms were analyzed using Student's t-tests. Statistical significance was set at the .05 level. RESULTS Frequently asked questions from Google were varied, while GPT-4 focused on technical details and indications/management. GPT-4 showed a higher Flesch-Kincaid grade level (12.96 vs 9.28, P = .003), lower Flesch Reading Ease score (37.07 vs 54.85, P = .005), and higher JAMA scores for source quality (3.333 vs 1.800, P = .016). Numerically, 6 out of 10 responses varied between platforms, with GPT-4 providing broader recovery timelines for ACDF. CONCLUSIONS This study demonstrates GPT-4's ability to elevate patient education by providing high-quality, diverse information tailored to those with advanced literacy levels. As AI technology evolves, refining these tools for accuracy and user-friendliness remains crucial, catering to patients' varying literacy levels and information needs in spine surgery.
Collapse
Affiliation(s)
- Paul G. Mastrokostas
- College of Medicine, State University of New York (SUNY) Downstate, Brooklyn, NY, USA
| | | | - Ahmed K. Emara
- Department of Orthopaedic Surgery, Cleveland Clinic, Cleveland, OH, USA
| | - Ian J. Wellington
- Department of Orthopaedic Surgery, University of Connecticut, Hartford, CT, USA
| | | | - John K. Houten
- Department of Neurosurgery, Mount Sinai School of Medicine, New York, NY, USA
| | - Amrit S. Khalsa
- Department of Orthopaedic Surgery, University of Pennsylvania, Philadelphia, PA, USA
| | - Ahmed Saleh
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Afshin E. Razi
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Mitchell K. Ng
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| |
Collapse
|
27
|
Rothka AJ, Lorenz FJ, Hearn M, Meci A, LaBarge B, Walen SG, Slonimsky G, McGinn J, Chung T, Goyal N. Utilizing Artificial Intelligence to Increase the Readability of Patient Education Materials in Pediatric Otolaryngology. EAR, NOSE & THROAT JOURNAL 2024:1455613241289647. [PMID: 39467826 DOI: 10.1177/01455613241289647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/30/2024] Open
Abstract
Objectives: To identify the reading levels of existing patient education materials in pediatric otolaryngology and to utilize natural language processing artificial intelligence (AI) to reduce the reading level of patient education materials. Methods: Patient education materials for pediatric conditions were identified from the American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) website. Patient education materials about the same conditions, if available, were identified and selected from the websites of 7 children's hospitals. The readability of the patient materials was scored before and after using AI with the Flesch-Kincaid calculator. ChatGPT version 3.5 was used to convert the materials to a fifth-grade reading level. Results: On average, AAO-HNS pediatric education material was written at a 10.71 ± 0.71 grade level. After requesting the reduction of those materials to a fifth-grade reading level, ChatGPT converted the same materials to an average grade level of 7.9 ± 1.18 (P < .01). When comparing the published materials from AAO-HNS and the 7 institutions, the average grade level was 9.32 ± 1.82, and ChatGPT was able to reduce the average level to 7.68 ± 1.12 (P = .0598). Of the 7 children's hospitals, only 1 institution had an average grade level below the recommended sixth-grade level. Conclusions: Patient education materials in pediatric otolaryngology were consistently above recommended reading levels. In its current state, AI can reduce the reading levels of education materials. However, it did not possess the capability to reduce all materials to be below the recommended reading level.
Collapse
Affiliation(s)
| | - F Jeffrey Lorenz
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | | | - Andrew Meci
- Penn State College of Medicine, Hershey, PA, USA
| | - Brandon LaBarge
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Scott G Walen
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Guy Slonimsky
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Johnathan McGinn
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Thomas Chung
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Neerav Goyal
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| |
Collapse
|
28
|
Ghanta SN, Al’Aref SJ, Lala-Trinidade A, Nadkarni GN, Ganatra S, Dani SS, Mehta JL. Applications of ChatGPT in Heart Failure Prevention, Diagnosis, Management, and Research: A Narrative Review. Diagnostics (Basel) 2024; 14:2393. [PMID: 39518361 PMCID: PMC11544991 DOI: 10.3390/diagnostics14212393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Revised: 10/22/2024] [Accepted: 10/24/2024] [Indexed: 11/16/2024] Open
Abstract
Heart failure (HF) is a leading cause of mortality, morbidity, and financial burden worldwide. The emergence of advanced artificial intelligence (AI) technologies, particularly Generative Pre-trained Transformer (GPT) systems, presents new opportunities to enhance HF management. In this review, we identified and examined existing studies on the use of ChatGPT in HF care by searching multiple medical databases (PubMed, Google Scholar, Medline, and Scopus). We assessed the role of ChatGPT in HF prevention, diagnosis, and management, focusing on its influence on clinical decision-making and patient education. However, ChatGPT faces limited training data, inherent biases, and ethical issues that hinder its widespread clinical adoption. We review these limitations and highlight the need for improved training approaches, greater model transparency, and robust regulatory compliance. Additionally, we explore the effectiveness of ChatGPT in managing HF, particularly in reducing hospital readmissions and improving patient outcomes with customized treatment plans while addressing social determinants of health (SDoH). In this review, we aim to provide healthcare professionals and policymakers with an in-depth understanding of ChatGPT's potential and constraints within the realm of HF care.
Collapse
Affiliation(s)
- Sai Nikhila Ghanta
- Department of Internal Medicine, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA;
| | - Subhi J. Al’Aref
- Division of Cardiology, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA;
| | - Anuradha Lala-Trinidade
- Division of Cardiology, Ichan School of Medicine at Mount Sinai, New York, NY 10029, USA; (A.L.-T.); (G.N.N.)
| | - Girish N. Nadkarni
- Division of Cardiology, Ichan School of Medicine at Mount Sinai, New York, NY 10029, USA; (A.L.-T.); (G.N.N.)
| | - Sarju Ganatra
- Division of Cardiology, Lahey Hospital and Medical Center, Burlington, MA 01805, USA;
| | - Sourbha S. Dani
- Division of Cardiology, Lahey Hospital and Medical Center, Burlington, MA 01805, USA;
| | - Jawahar L. Mehta
- Division of Cardiology, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA;
| |
Collapse
|
29
|
Kikuchi T, Nakao T, Nakamura Y, Hanaoka S, Mori H, Yoshikawa T. Toward Improved Radiologic Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases. AJNR Am J Neuroradiol 2024; 45:1506-1511. [PMID: 38719605 PMCID: PMC11448975 DOI: 10.3174/ajnr.a8332] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Accepted: 05/03/2024] [Indexed: 08/17/2024]
Abstract
BACKGROUND AND PURPOSE The rise of large language models such as generative pretrained transformers (GPTs) has sparked considerable interest in radiology, especially in interpreting radiologic reports and image findings. While existing research has focused on GPTs estimating diagnoses from radiologic descriptions, exploring alternative diagnostic information sources is also crucial. This study introduces the use of GPTs (GPT-3.5 Turbo and GPT-4) for information retrieval and summarization, searching relevant case reports via PubMed, and investigates their potential to aid diagnosis. MATERIALS AND METHODS From October 2021 to December 2023, we selected 115 cases from the "Case of the Week" series on the American Journal of Neuroradiology website. Their Description and Legend sections were presented to the GPTs for the 2 tasks. For the Direct Diagnosis task, the models provided 3 differential diagnoses that were considered correct if they matched the diagnosis in the diagnosis section. For the Case Report Search task, the models generated 2 keywords per case, creating PubMed search queries to extract up to 3 relevant reports. A response was considered correct if reports containing the disease name stated in the diagnosis section were extracted. The McNemar test was used to evaluate whether adding a Case Report Search to Direct Diagnosis improved overall accuracy. RESULTS In the Direct Diagnosis task, GPT-3.5 Turbo achieved a correct response rate of 26% (30/115 cases), whereas GPT-4 achieved 41% (47/115). For the Case Report Search task, GPT-3.5 Turbo scored 10% (11/115), and GPT-4 scored 7% (8/115). Correct responses totaled 32% (37/115) with 3 overlapping cases for GPT-3.5 Turbo, whereas GPT-4 had 43% (50/115) of correct responses with 5 overlapping cases. Adding Case Report Search improved GPT-3.5 Turbo's performance (P = .023) but not that of GPT-4 (P = .248). CONCLUSIONS The effectiveness of adding Case Report Search to GPT-3.5 Turbo was particularly pronounced, suggesting its potential as an alternative diagnostic approach to GPTs, particularly in scenarios where direct diagnoses from GPTs are not obtainable. Nevertheless, the overall performance of GPT models in both direct diagnosis and case report retrieval tasks remains not optimal, and users should be aware of their limitations.
Collapse
Affiliation(s)
- Tomohiro Kikuchi
- From the Departments of Computational Diagnostic Radiology and Preventive Medicine (T.K., T.N., Y.N., T.Y.), The University of Tokyo Hospital, Tokyo, Japan
- Department of Radiology (T.K., H.M.), School of Medicine, Jichi Medical University, Shimotsuke, Tochigi, Japan
| | - Takahiro Nakao
- From the Departments of Computational Diagnostic Radiology and Preventive Medicine (T.K., T.N., Y.N., T.Y.), The University of Tokyo Hospital, Tokyo, Japan
| | - Yuta Nakamura
- From the Departments of Computational Diagnostic Radiology and Preventive Medicine (T.K., T.N., Y.N., T.Y.), The University of Tokyo Hospital, Tokyo, Japan
| | - Shouhei Hanaoka
- Departments of Radiology (S.H.), The University of Tokyo Hospital, Tokyo, Japan
| | - Harushi Mori
- Department of Radiology (T.K., H.M.), School of Medicine, Jichi Medical University, Shimotsuke, Tochigi, Japan
| | - Takeharu Yoshikawa
- From the Departments of Computational Diagnostic Radiology and Preventive Medicine (T.K., T.N., Y.N., T.Y.), The University of Tokyo Hospital, Tokyo, Japan
| |
Collapse
|
30
|
Cai Y, Zhao R, Zhao H, Li Y, Gou L. Exploring the use of ChatGPT/GPT-4 for patient follow-up after oral surgeries. Int J Oral Maxillofac Surg 2024; 53:867-872. [PMID: 38664106 DOI: 10.1016/j.ijom.2024.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 03/17/2024] [Accepted: 04/10/2024] [Indexed: 08/27/2024]
Abstract
Since 2023, ChatGPT has been leading a research boom in large language models. Research on the applications of large language models in various fields is also being explored. The aim of this study was to explore the use of ChatGPT/GPT-4 for post-surgery patient follow-up after oral surgery. Thirty questions that are the most commonly asked or may be encountered during follow-up and in daily practice were collected to test ChatGPT/GPT-4's responses. A standard prompt was used for each question. The responses given by ChatGPT/GPT-4 were evaluated by three experienced oral and maxillofacial surgeons to assess the suitability of this technology for clinical follow-up, based on the accuracy of medical knowledge and rationality of the advice in ChatGPT/GPT-4's responses. ChatGPT/GPT-4 achieved full marks in terms of both the accuracy of its medical knowledge and the rationality of its recommendations. Additionally, ChatGPT/GPT-4 was able to accurately sense patient emotions and provide them with reassurance. In conclusion, ChatGPT/GPT-4 could be used for patient follow-up after oral surgeries, but this should be done with careful consideration of the technology's current limitations and under the guidance of healthcare professionals.
Collapse
Affiliation(s)
- Y Cai
- Chongqing Key Laboratory of Oral Diseases and Biomedical Sciences, The Stomatological Hospital of Chongqing Medical University, Chongqing, China; Department of Oral and Maxillofacial Surgery, The Stomatological Hospital of Chongqing Medical University, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - R Zhao
- Department of Oral and Maxillofacial Surgery, The Stomatological Hospital of Chongqing Medical University, Chongqing, China
| | - H Zhao
- Chongqing Key Laboratory of Oral Diseases and Biomedical Sciences, The Stomatological Hospital of Chongqing Medical University, Chongqing, China; Department of Oral and Maxillofacial Surgery, The Stomatological Hospital of Chongqing Medical University, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - Y Li
- Chongqing Key Laboratory of Oral Diseases and Biomedical Sciences, The Stomatological Hospital of Chongqing Medical University, Chongqing, China; Department of Oral and Maxillofacial Surgery, The Stomatological Hospital of Chongqing Medical University, Chongqing, China
| | - L Gou
- Chongqing Key Laboratory of Oral Diseases and Biomedical Sciences, The Stomatological Hospital of Chongqing Medical University, Chongqing, China; Department of Oral and Maxillofacial Surgery, The Stomatological Hospital of Chongqing Medical University, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China.
| |
Collapse
|
31
|
Halawani A, Almehmadi SG, Alhubaishy BA, Alnefaie ZA, Hasan MN. Empowering patients: how accurate and readable are large language models in renal cancer education. Front Oncol 2024; 14:1457516. [PMID: 39391252 PMCID: PMC11464325 DOI: 10.3389/fonc.2024.1457516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Accepted: 09/09/2024] [Indexed: 10/12/2024] Open
Abstract
Background The incorporation of Artificial Intelligence (AI) into healthcare sector has fundamentally transformed patient care paradigms, particularly through the creation of patient education materials (PEMs) tailored to individual needs. This Study aims to assess the precision and readability AI-generated information on kidney cancer using ChatGPT 4.0, Gemini AI, and Perplexity AI., comparing these outputs to PEMs provided by the American Urological Association (AUA) and the European Association of Urology (EAU). The objective is to guide physicians in directing patients to accurate and understandable resources. Methods PEMs published by AUA and EAU were collected and categorized. kidney cancer-related queries, identified via Google Trends (GT), were input into CahtGPT-4.0, Gemini AI, and Perplexity AI. Four independent reviewers assessed the AI outputs for accuracy grounded on five distinct categories, employing a 5-point Likert scale. A readability evaluation was conducted utilizing established formulas, including Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), and Flesch-Kincaid Grade Formula (FKGL). AI chatbots were then tasked with simplifying their outputs to achieve a sixth-grade reading level. Results The PEM published by the AUA was the most readable with a mean readability score of 9.84 ± 1.2, in contrast to EAU (11.88 ± 1.11), ChatGPT-4.0 (11.03 ± 1.76), Perplexity AI (12.66 ± 1.83), and Gemini AI (10.83 ± 2.31). The Chatbots demonstrated the capability to simplify text lower grade levels upon request, with ChatGPT-4.0 achieving a readability grade level ranging from 5.76 to 9.19, Perplexity AI from 7.33 to 8.45, Gemini AI from 6.43 to 8.43. While official PEMS were considered accurate, the LLMs generated outputs exhibited an overall high level of accuracy with minor detail omission and some information inaccuracies. Information related to kidney cancer treatment was found to be the least accurate among the evaluated categories. Conclusion Although the PEM published by AUA being the most readable, both authoritative PEMs and Large Language Models (LLMs) generated outputs exceeded the recommended readability threshold for general population. AI Chatbots can simplify their outputs when explicitly instructed. However, notwithstanding their accuracy, LLMs-generated outputs are susceptible to detail omission and inaccuracies. The variability in AI performance necessitates cautious use as an adjunctive tool in patient education.
Collapse
Affiliation(s)
| | | | | | - Ziyad A. Alnefaie
- Department of Urology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mudhar N. Hasan
- Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
- Department of Urology, Mediclinic City Hospital, Dubai, United Arab Emirates
| |
Collapse
|
32
|
Filetti S, Fenza G, Gallo A. Research design and writing of scholarly articles: new artificial intelligence tools available for researchers. Endocrine 2024; 85:1104-1116. [PMID: 39085566 DOI: 10.1007/s12020-024-03977-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Accepted: 07/22/2024] [Indexed: 08/02/2024]
|
33
|
Hersh W. Search still matters: information retrieval in the era of generative AI. J Am Med Inform Assoc 2024; 31:2159-2161. [PMID: 38287655 PMCID: PMC11339511 DOI: 10.1093/jamia/ocae014] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 12/19/2023] [Accepted: 01/15/2024] [Indexed: 01/31/2024] Open
Abstract
OBJECTIVE Information retrieval (IR, also known as search) systems are ubiquitous in modern times. How does the emergence of generative artificial intelligence (AI), based on large language models (LLMs), fit into the IR process? PROCESS This perspective explores the use of generative AI in the context of the motivations, considerations, and outcomes of the IR process with a focus on the academic use of such systems. CONCLUSIONS There are many information needs, from simple to complex, that motivate use of IR. Users of such systems, particularly academics, have concerns for authoritativeness, timeliness, and contextualization of search. While LLMs may provide functionality that aids the IR process, the continued need for search systems, and research into their improvement, remains essential.
Collapse
Affiliation(s)
- William Hersh
- Department of Medical Informatics & Clinical Epidemiology, School of Medicine, Oregon Health & Science University, Portland, OR 97239, United States
| |
Collapse
|
34
|
Lehr SA, Caliskan A, Liyanage S, Banaji MR. ChatGPT as Research Scientist: Probing GPT's capabilities as a Research Librarian, Research Ethicist, Data Generator, and Data Predictor. Proc Natl Acad Sci U S A 2024; 121:e2404328121. [PMID: 39163339 PMCID: PMC11363351 DOI: 10.1073/pnas.2404328121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 07/01/2024] [Indexed: 08/22/2024] Open
Abstract
How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more vs. less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.
Collapse
Affiliation(s)
| | - Aylin Caliskan
- Information School, University of Washington, Seattle, WA98195
| | | | | |
Collapse
|
35
|
Bridges JM. Computerized diagnostic decision support systems - a comparative performance study of Isabel Pro vs. ChatGPT4. Diagnosis (Berl) 2024; 11:250-258. [PMID: 38709491 DOI: 10.1515/dx-2024-0033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 04/22/2024] [Indexed: 05/07/2024]
Abstract
OBJECTIVES Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro. METHODS This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis's rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis. RESULTS Both systems produced comprehensive differential diagnosis lists. Isabel Pro's list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %). CONCLUSIONS This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.
Collapse
Affiliation(s)
- Joe M Bridges
- D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, USA
| |
Collapse
|
36
|
Langston E, Charness N, Boot W. Are Virtual Assistants Trustworthy for Medicare Information: An Examination of Accuracy and Reliability. THE GERONTOLOGIST 2024; 64:gnae062. [PMID: 38832398 PMCID: PMC11258897 DOI: 10.1093/geront/gnae062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND AND OBJECTIVES Advances in artificial intelligence (AI)-based virtual assistants provide a potential opportunity for older adults to use this technology in the context of health information-seeking. Meta-analysis on trust in AI shows that users are influenced by the accuracy and reliability of the AI trustee. We evaluated these dimensions for responses to Medicare queries. RESEARCH DESIGN AND METHODS During the summer of 2023, we assessed the accuracy and reliability of Alexa, Google Assistant, Bard, and ChatGPT-4 on Medicare terminology and general content from a large, standardized question set. We compared the accuracy of these AI systems to that of a large representative sample of Medicare beneficiaries who were queried twenty years prior. RESULTS Alexa and Google Assistant were found to be highly inaccurate when compared to beneficiaries' mean accuracy of 68.4% on terminology queries and 53.0% on general Medicare content. Bard and ChatGPT-4 answered Medicare terminology queries perfectly and performed much better on general Medicare content queries (Bard = 96.3%, ChatGPT-4 = 92.6%) than the average Medicare beneficiary. About one month to a month-and-a-half later, we found that Bard and Alexa's accuracy stayed the same, whereas ChatGPT-4's performance nominally decreased, and Google Assistant's performance nominally increased. DISCUSSION AND IMPLICATIONS LLM-based assistants generate trustworthy information in response to carefully phrased queries about Medicare, in contrast to Alexa and Google Assistant. Further studies will be needed to determine what factors beyond accuracy and reliability influence the adoption and use of such technology for Medicare decision-making.
Collapse
Affiliation(s)
- Emily Langston
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| | - Neil Charness
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| | - Walter Boot
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
37
|
Vij O, Calver H, Myall N, Dey M, Kouranloo K. Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLoS One 2024; 19:e0307372. [PMID: 39083455 PMCID: PMC11290618 DOI: 10.1371/journal.pone.0307372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Accepted: 07/03/2024] [Indexed: 08/02/2024] Open
Abstract
OBJECTIVES As a large language model (LLM) trained on a large data set, ChatGPT can perform a wide array of tasks without additional training. We evaluated the performance of ChatGPT on postgraduate UK medical examinations through a systematic literature review of ChatGPT's performance in UK postgraduate medical assessments and its performance on Member of Royal College of Physicians (MRCP) Part 1 examination. METHODS Medline, Embase and Cochrane databases were searched. Articles discussing the performance of ChatGPT in UK postgraduate medical examinations were included in the systematic review. Information was extracted on exam performance including percentage scores and pass/fail rates. MRCP UK Part 1 sample paper questions were inserted into ChatGPT-3.5 and -4 four times each and the scores marked against the correct answers provided. RESULTS 12 studies were ultimately included in the systematic literature review. ChatGPT-3.5 scored 66.4% and ChatGPT-4 scored 84.8% on MRCP Part 1 sample paper, which is 4.4% and 22.8% above the historical pass mark respectively. Both ChatGPT-3.5 and -4 performance was significantly above the historical pass mark for MRCP Part 1, indicating they would likely pass this examination. ChatGPT-3.5 failed eight out of nine postgraduate exams it performed with an average percentage of 5.0% below the pass mark. ChatGPT-4 passed nine out of eleven postgraduate exams it performed with an average percentage of 13.56% above the pass mark. ChatGPT-4 performance was significantly better than ChatGPT-3.5 in all examinations that both models were tested on. CONCLUSION ChatGPT-4 performed at above passing level for the majority of UK postgraduate medical examinations it was tested on. ChatGPT is prone to hallucinations, fabrications and reduced explanation accuracy which could limit its potential as a learning tool. The potential for these errors is an inherent part of LLMs and may always be a limitation for medical applications of ChatGPT.
Collapse
Affiliation(s)
- Oliver Vij
- Guy’s Hospital, Guy’s and St Thomas’ NHS Foundation Trust, Great Maze Pond, London, United Kingdom
| | | | - Nikki Myall
- British Medical Association Library, BMA House, Tavistock Square, London, United Kingdom
| | - Mrinalini Dey
- Centre for Rheumatic Diseases, Denmark Hill Campus King’s College London, London, United Kingdom
| | - Koushan Kouranloo
- Department of Rheumatology, University Hospital Lewisham, London, United Kingdom
- School of Medicine, Cedar House, University of Liverpool, Liverpool, United Kingdom
| |
Collapse
|
38
|
Aljamaan F, Temsah MH, Altamimi I, Al-Eyadhy A, Jamal A, Alhasan K, Mesallam TA, Farahat M, Malki KH. Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study. JMIR Med Inform 2024; 12:e54345. [PMID: 39083799 PMCID: PMC11325115 DOI: 10.2196/54345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 01/05/2024] [Accepted: 07/03/2024] [Indexed: 08/02/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation. OBJECTIVE The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots' citations. METHODS Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference's relevance to prompts' keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots. RESULTS Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (β coefficient=-0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (β coefficient=-0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (β coefficient=0.486; P<.001). CONCLUSIONS The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots' RHS could contribute to ongoing efforts to enhance AI's general reliability in medical research.
Collapse
Affiliation(s)
- Fadi Aljamaan
- College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | | | | | - Ayman Al-Eyadhy
- College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Amr Jamal
- College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Khalid Alhasan
- College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Tamer A Mesallam
- Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia
| | - Mohamed Farahat
- Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia
| | - Khalid H Malki
- Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
39
|
Gorenz D, Schwarz N. How funny is ChatGPT? A comparison of human- and A.I.-produced jokes. PLoS One 2024; 19:e0305364. [PMID: 38959273 PMCID: PMC11221738 DOI: 10.1371/journal.pone.0305364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 05/28/2024] [Indexed: 07/05/2024] Open
Abstract
Can a large language model produce humor? Past research has focused on anecdotal examples of large language models succeeding or failing at producing humor. These examples, while interesting, do not examine ChatGPT's humor production abilities in ways comparable to humans' abilities, nor do they shed light on how funny ChatGPT is to the general public. To provide a systematic test, we asked ChatGPT 3.5 and laypeople to respond to the same humor prompts (Study 1). We also asked ChatGPT 3.5 to generate humorous satirical headlines in the style of The Onion and compared them to published headlines of the satirical magazine, written by professional comedy writers (Study 2). In both studies, human participants rated the funniness of the human and A.I.-produced responses without being aware of their source. ChatGPT 3.5-produced jokes were rated as equally funny or funnier than human-produced jokes regardless of the comedic task and the expertise of the human comedy writer.
Collapse
Affiliation(s)
- Drew Gorenz
- Department of Psychology, University of Southern California, Los Angeles, California, United States of America
- Mind & Society Center, University of Southern California, Los Angeles, California, United States of America
| | - Norbert Schwarz
- Department of Psychology, University of Southern California, Los Angeles, California, United States of America
- Mind & Society Center, University of Southern California, Los Angeles, California, United States of America
- Marshall School of Business, University of Southern California, Los Angeles, California, United States of America
| |
Collapse
|
40
|
Heinke A, Radgoudarzi N, Huang BB, Baxter SL. A review of ophthalmology education in the era of generative artificial intelligence. Asia Pac J Ophthalmol (Phila) 2024; 13:100089. [PMID: 39134176 PMCID: PMC11934932 DOI: 10.1016/j.apjo.2024.100089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Revised: 07/31/2024] [Accepted: 08/02/2024] [Indexed: 08/18/2024] Open
Abstract
PURPOSE To explore the integration of generative AI, specifically large language models (LLMs), in ophthalmology education and practice, addressing their applications, benefits, challenges, and future directions. DESIGN A literature review and analysis of current AI applications and educational programs in ophthalmology. METHODS Analysis of published studies, reviews, articles, websites, and institutional reports on AI use in ophthalmology. Examination of educational programs incorporating AI, including curriculum frameworks, training methodologies, and evaluations of AI performance on medical examinations and clinical case studies. RESULTS Generative AI, particularly LLMs, shows potential to improve diagnostic accuracy and patient care in ophthalmology. Applications include aiding in patient, physician, and medical students' education. However, challenges such as AI hallucinations, biases, lack of interpretability, and outdated training data limit clinical deployment. Studies revealed varying levels of accuracy of LLMs on ophthalmology board exam questions, underscoring the need for more reliable AI integration. Several educational programs nationwide provide AI and data science training relevant to clinical medicine and ophthalmology. CONCLUSIONS Generative AI and LLMs offer promising advancements in ophthalmology education and practice. Addressing challenges through comprehensive curricula that include fundamental AI principles, ethical guidelines, and updated, unbiased training data is crucial. Future directions include developing clinically relevant evaluation metrics, implementing hybrid models with human oversight, leveraging image-rich data, and benchmarking AI performance against ophthalmologists. Robust policies on data privacy, security, and transparency are essential for fostering a safe and ethical environment for AI applications in ophthalmology.
Collapse
Affiliation(s)
- Anna Heinke
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Jacobs Retina Center, 9415 Campus Point Drive, La Jolla, CA 92037, USA
| | - Niloofar Radgoudarzi
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA
| | - Bonnie B Huang
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA; Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Sally L Baxter
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
41
|
Chu CP. ChatGPT in veterinary medicine: a practical guidance of generative artificial intelligence in clinics, education, and research. Front Vet Sci 2024; 11:1395934. [PMID: 38911678 PMCID: PMC11192069 DOI: 10.3389/fvets.2024.1395934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 05/21/2024] [Indexed: 06/25/2024] Open
Abstract
ChatGPT, the most accessible generative artificial intelligence (AI) tool, offers considerable potential for veterinary medicine, yet a dedicated review of its specific applications is lacking. This review concisely synthesizes the latest research and practical applications of ChatGPT within the clinical, educational, and research domains of veterinary medicine. It intends to provide specific guidance and actionable examples of how generative AI can be directly utilized by veterinary professionals without a programming background. For practitioners, ChatGPT can extract patient data, generate progress notes, and potentially assist in diagnosing complex cases. Veterinary educators can create custom GPTs for student support, while students can utilize ChatGPT for exam preparation. ChatGPT can aid in academic writing tasks in research, but veterinary publishers have set specific requirements for authors to follow. Despite its transformative potential, careful use is essential to avoid pitfalls like hallucination. This review addresses ethical considerations, provides learning resources, and offers tangible examples to guide responsible implementation. A table of key takeaways was provided to summarize this review. By highlighting potential benefits and limitations, this review equips veterinarians, educators, and researchers to harness the power of ChatGPT effectively.
Collapse
Affiliation(s)
- Candice P. Chu
- Department of Veterinary Pathobiology, College of Veterinary Medicine & Biomedical Sciences, Texas A&M University, College Station, TX, United States
| |
Collapse
|
42
|
Lee TJ, Campbell DJ, Rao AK, Hossain A, Elkattawy O, Radfar N, Lee P, Gardin JM. Evaluating ChatGPT Responses on Atrial Fibrillation for Patient Education. Cureus 2024; 16:e61680. [PMID: 38841294 PMCID: PMC11151148 DOI: 10.7759/cureus.61680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/04/2024] [Indexed: 06/07/2024] Open
Abstract
Background ChatGPT is a language model that has gained widespread popularity for its fine-tuned conversational abilities. However, a known drawback to the artificial intelligence (AI) chatbot is its tendency to confidently present users with inaccurate information. We evaluated the quality of ChatGPT responses to questions pertaining to atrial fibrillation for patient education. Our analysis included the accuracy and estimated grade level of answers and whether references were provided for the answers. Methodology ChatGPT was prompted four times and 16 frequently asked questions on atrial fibrillation from the American Heart Association were asked. Prompts included Form 1 (no prompt), Form 2 (patient-friendly prompt), Form 3 (physician-level prompt), and Form 4 (prompting for statistics/references). Responses were scored as incorrect, partially correct, or correct with references (perfect). Flesch-Kincaid grade-level unique words and response lengths were recorded for answers. Proportions of the responses at differing scores were compared using the chi-square analysis. The relationship between form and grade level was assessed using the analysis of variance. Results Across all forms, scoring frequencies were one (1.6%) incorrect, five (7.8%) partially correct, 55 (85.9%) correct, and three (4.7%) perfect. Proportions of responses that were at least correct did not differ by form (p = 0.350), but perfect responses did (p = 0.001). Form 2 answers had a lower mean grade level (12.80 ± 3.38) than Forms 1 (14.23 ± 2.34), 3 (16.73 ± 2.65), and 4 (14.85 ± 2.76) (p < 0.05). Across all forms, references were provided in only three (4.7%) answers. Notably, when additionally prompted for sources or references, ChatGPT still only provided sources on three responses out of 16 (18.8%). Conclusions ChatGPT holds significant potential for enhancing patient education through accurate, adaptive responses. Its ability to alter response complexity based on user input, combined with high accuracy rates, supports its use as an informational resource in healthcare settings. Future advancements and continuous monitoring of AI capabilities will be crucial in maximizing the benefits while mitigating the risks associated with AI-driven patient education.
Collapse
Affiliation(s)
- Thomas J Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Daniel J Campbell
- Otolaryngology-Head and Neck Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Abhinav K Rao
- Department of Medicine, Trident Medical Center, Charleston, USA
| | - Afif Hossain
- Department of Medicine/Division of Cardiology, Rutgers University New Jersey Medical School, Newark, USA
| | - Omar Elkattawy
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Navid Radfar
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Paul Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Julius M Gardin
- Department of Medicine/Division of Cardiology, Rutgers University New Jersey Medical School, Newark, USA
| |
Collapse
|
43
|
Hershenhouse JS, Mokhtar D, Eppler MB, Rodler S, Storino Ramacciotti L, Ganjavi C, Hom B, Davis RJ, Tran J, Russo GI, Cocci A, Abreu A, Gill I, Desai M, Cacciamani GE. Accuracy, readability, and understandability of large language models for prostate cancer information to the public. Prostate Cancer Prostatic Dis 2024:10.1038/s41391-024-00826-y. [PMID: 38744934 DOI: 10.1038/s41391-024-00826-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 03/14/2024] [Accepted: 03/26/2024] [Indexed: 05/16/2024]
Abstract
BACKGROUND Generative Pretrained Model (GPT) chatbots have gained popularity since the public release of ChatGPT. Studies have evaluated the ability of different GPT models to provide information about medical conditions. To date, no study has assessed the quality of ChatGPT outputs to prostate cancer related questions from both the physician and public perspective while optimizing outputs for patient consumption. METHODS Nine prostate cancer-related questions, identified through Google Trends (Global), were categorized into diagnosis, treatment, and postoperative follow-up. These questions were processed using ChatGPT 3.5, and the responses were recorded. Subsequently, these responses were re-inputted into ChatGPT to create simplified summaries understandable at a sixth-grade level. Readability of both the original ChatGPT responses and the layperson summaries was evaluated using validated readability tools. A survey was conducted among urology providers (urologists and urologists in training) to rate the original ChatGPT responses for accuracy, completeness, and clarity using a 5-point Likert scale. Furthermore, two independent reviewers evaluated the layperson summaries on correctness trifecta: accuracy, completeness, and decision-making sufficiency. Public assessment of the simplified summaries' clarity and understandability was carried out through Amazon Mechanical Turk (MTurk). Participants rated the clarity and demonstrated their understanding through a multiple-choice question. RESULTS GPT-generated output was deemed correct by 71.7% to 94.3% of raters (36 urologists, 17 urology residents) across 9 scenarios. GPT-generated simplified layperson summaries of this output was rated as accurate in 8 of 9 (88.9%) scenarios and sufficient for a patient to make a decision in 8 of 9 (88.9%) scenarios. Mean readability of layperson summaries was higher than original GPT outputs ([original ChatGPT v. simplified ChatGPT, mean (SD), p-value] Flesch Reading Ease: 36.5(9.1) v. 70.2(11.2), <0.0001; Gunning Fog: 15.8(1.7) v. 9.5(2.0), p < 0.0001; Flesch Grade Level: 12.8(1.2) v. 7.4(1.7), p < 0.0001; Coleman Liau: 13.7(2.1) v. 8.6(2.4), 0.0002; Smog index: 11.8(1.2) v. 6.7(1.8), <0.0001; Automated Readability Index: 13.1(1.4) v. 7.5(2.1), p < 0.0001). MTurk workers (n = 514) rated the layperson summaries as correct (89.5-95.7%) and correctly understood the content (63.0-87.4%). CONCLUSION GPT shows promise for correct patient education for prostate cancer-related contents, but the technology is not designed for delivering patients information. Prompting the model to respond with accuracy, completeness, clarity and readability may enhance its utility when used for GPT-powered medical chatbots.
Collapse
Affiliation(s)
- Jacob S Hershenhouse
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Daniel Mokhtar
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Michael B Eppler
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Severin Rodler
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Lorenzo Storino Ramacciotti
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Conner Ganjavi
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Brian Hom
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Ryan J Davis
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - John Tran
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | | | - Andrea Cocci
- Urology Section, University of Florence, Florence, Italy
| | - Andre Abreu
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Inderbir Gill
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Mihir Desai
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Giovanni E Cacciamani
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
44
|
Howard FM, Li A, Riffon MF, Garrett-Mayer E, Pearson AT. Characterizing the Increase in Artificial Intelligence Content Detection in Oncology Scientific Abstracts From 2021 to 2023. JCO Clin Cancer Inform 2024; 8:e2400077. [PMID: 38822755 PMCID: PMC11371107 DOI: 10.1200/cci.24.00077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 04/25/2024] [Accepted: 04/26/2024] [Indexed: 06/03/2024] Open
Abstract
PURPOSE Artificial intelligence (AI) models can generate scientific abstracts that are difficult to distinguish from the work of human authors. The use of AI in scientific writing and performance of AI detection tools are poorly characterized. METHODS We extracted text from published scientific abstracts from the ASCO 2021-2023 Annual Meetings. Likelihood of AI content was evaluated by three detectors: GPTZero, Originality.ai, and Sapling. Optimal thresholds for AI content detection were selected using 100 abstracts from before 2020 as negative controls, and 100 produced by OpenAI's GPT-3 and GPT-4 models as positive controls. Logistic regression was used to evaluate the association of predicted AI content with submission year and abstract characteristics, and adjusted odds ratios (aORs) were computed. RESULTS Fifteen thousand five hundred and fifty-three abstracts met inclusion criteria. Across detectors, abstracts submitted in 2023 were significantly more likely to contain AI content than those in 2021 (aOR range from 1.79 with Originality to 2.37 with Sapling). Online-only publication and lack of clinical trial number were consistently associated with AI content. With optimal thresholds, 99.5%, 96%, and 97% of GPT-3/4-generated abstracts were identified by GPTZero, Originality, and Sapling respectively, and no sampled abstracts from before 2020 were classified as AI generated by the GPTZero and Originality detectors. Correlation between detectors was low to moderate, with Spearman correlation coefficient ranging from 0.14 for Originality and Sapling to 0.47 for Sapling and GPTZero. CONCLUSION There is an increasing signal of AI content in ASCO abstracts, coinciding with the growing popularity of generative AI models.
Collapse
Affiliation(s)
- Frederick M. Howard
- Section of Hematology/Oncology, Department of Medicine, The University of Chicago, Chicago, IL
| | - Anran Li
- Section of Hematology/Oncology, Department of Medicine, The University of Chicago, Chicago, IL
| | - Mark F. Riffon
- Center for Research and Analytics, American Society of Clinical Oncology, Alexandria, VA
| | | | - Alexander T. Pearson
- Section of Hematology/Oncology, Department of Medicine, The University of Chicago, Chicago, IL
| |
Collapse
|
45
|
Roberts LW. Addressing the Novel Implications of Generative AI for Academic Publishing, Education, and Research. ACADEMIC MEDICINE : JOURNAL OF THE ASSOCIATION OF AMERICAN MEDICAL COLLEGES 2024; 99:471-473. [PMID: 38451086 DOI: 10.1097/acm.0000000000005667] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/08/2024]
|
46
|
Garcia Valencia OA, Thongprayoon C, Jadlowiec CC, Mao SA, Leeaphorn N, Budhiraja P, Craici IM, Gonzalez Suarez ML, Cheungpasitporn W. AI-driven translations for kidney transplant equity in Hispanic populations. Sci Rep 2024; 14:8511. [PMID: 38609476 PMCID: PMC11014982 DOI: 10.1038/s41598-024-59237-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 04/08/2024] [Indexed: 04/14/2024] Open
Abstract
Health equity and accessing Spanish kidney transplant information continues being a substantial challenge facing the Hispanic community. This study evaluated ChatGPT's capabilities in translating 54 English kidney transplant frequently asked questions (FAQs) into Spanish using two versions of the AI model, GPT-3.5 and GPT-4.0. The FAQs included 19 from Organ Procurement and Transplantation Network (OPTN), 15 from National Health Service (NHS), and 20 from National Kidney Foundation (NKF). Two native Spanish-speaking nephrologists, both of whom are of Mexican heritage, scored the translations for linguistic accuracy and cultural sensitivity tailored to Hispanics using a 1-5 rubric. The inter-rater reliability of the evaluators, measured by Cohen's Kappa, was 0.85. Overall linguistic accuracy was 4.89 ± 0.31 for GPT-3.5 versus 4.94 ± 0.23 for GPT-4.0 (non-significant p = 0.23). Both versions scored 4.96 ± 0.19 in cultural sensitivity (p = 1.00). By source, GPT-3.5 linguistic accuracy was 4.84 ± 0.37 (OPTN), 4.93 ± 0.26 (NHS), 4.90 ± 0.31 (NKF). GPT-4.0 scored 4.95 ± 0.23 (OPTN), 4.93 ± 0.26 (NHS), 4.95 ± 0.22 (NKF). For cultural sensitivity, GPT-3.5 scored 4.95 ± 0.23 (OPTN), 4.93 ± 0.26 (NHS), 5.00 ± 0.00 (NKF), while GPT-4.0 scored 5.00 ± 0.00 (OPTN), 5.00 ± 0.00 (NHS), 4.90 ± 0.31 (NKF). These high linguistic and cultural sensitivity scores demonstrate Chat GPT effectively translated the English FAQs into Spanish across systems. The findings suggest Chat GPT's potential to promote health equity by improving Spanish access to essential kidney transplant information. Additional research should evaluate its medical translation capabilities across diverse contexts/languages. These English-to-Spanish translations may increase access to vital transplant information for underserved Spanish-speaking Hispanic patients.
Collapse
Affiliation(s)
- Oscar A Garcia Valencia
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Charat Thongprayoon
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | | | - Shennen A Mao
- Division of Transplant Surgery, Department of Transplantation, Mayo Clinic, Jacksonville, FL, USA
| | - Napat Leeaphorn
- Division of Transplant Surgery, Department of Transplantation, Mayo Clinic, Jacksonville, FL, USA
- Department of Transplant, Mayo Clinic, Jacksonville, USA
| | - Pooja Budhiraja
- Division of Transplant Surgery, Mayo Clinic, Phoenix, AZ, USA
| | - Iasmina M Craici
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Maria L Gonzalez Suarez
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Wisit Cheungpasitporn
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
47
|
Yuan S, Li F, Browning MHEM, Bardhan M, Zhang K, McAnirlin O, Patwary MM, Reuben A. Leveraging and exercising caution with ChatGPT and other generative artificial intelligence tools in environmental psychology research. Front Psychol 2024; 15:1295275. [PMID: 38650897 PMCID: PMC11033305 DOI: 10.3389/fpsyg.2024.1295275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 03/01/2024] [Indexed: 04/25/2024] Open
Abstract
Generative Artificial Intelligence (GAI) is an emerging and disruptive technology that has attracted considerable interest from researchers and educators across various disciplines. We discuss the relevance and concerns of ChatGPT and other GAI tools in environmental psychology research. We propose three use categories for GAI tools: integrated and contextualized understanding, practical and flexible implementation, and two-way external communication. These categories are exemplified by topics such as the health benefits of green space, theory building, visual simulation, and identifying practical relevance. However, we also highlight the balance of productivity with ethical issues, as well as the need for ethical guidelines, professional training, and changes in the academic performance evaluation systems. We hope this perspective can foster constructive dialogue and responsible practice of GAI tools.
Collapse
Affiliation(s)
- Shuai Yuan
- Virtual Reality and Nature Lab, Department of Parks, Recreation and Tourism Management, Clemson University, Clemson, SC, United States
| | - Fu Li
- Virtual Reality and Nature Lab, Department of Parks, Recreation and Tourism Management, Clemson University, Clemson, SC, United States
| | - Matthew H. E. M. Browning
- Virtual Reality and Nature Lab, Department of Parks, Recreation and Tourism Management, Clemson University, Clemson, SC, United States
| | - Mondira Bardhan
- Virtual Reality and Nature Lab, Department of Parks, Recreation and Tourism Management, Clemson University, Clemson, SC, United States
| | - Kuiran Zhang
- Virtual Reality and Nature Lab, Department of Parks, Recreation and Tourism Management, Clemson University, Clemson, SC, United States
| | - Olivia McAnirlin
- Virtual Reality and Nature Lab, Department of Parks, Recreation and Tourism Management, Clemson University, Clemson, SC, United States
| | - Muhammad Mainuddin Patwary
- Environment and Sustainability Research Initiative, Khulna, Bangladesh
- Environmental Science Discipline, Life Science School, Khulna University, Khulna, Bangladesh
| | - Aaron Reuben
- Department of Psychology and Neuroscience, Duke University, Durham, NC, United States
| |
Collapse
|
48
|
Stribling D, Xia Y, Amer MK, Graim KS, Mulligan CJ, Renne R. The model student: GPT-4 performance on graduate biomedical science exams. Sci Rep 2024; 14:5670. [PMID: 38453979 PMCID: PMC10920673 DOI: 10.1038/s41598-024-55568-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 02/25/2024] [Indexed: 03/09/2024] Open
Abstract
The GPT-4 large language model (LLM) and ChatGPT chatbot have emerged as accessible and capable tools for generating English-language text in a variety of formats. GPT-4 has previously performed well when applied to questions from multiple standardized examinations. However, further evaluation of trustworthiness and accuracy of GPT-4 responses across various knowledge domains is essential before its use as a reference resource. Here, we assess GPT-4 performance on nine graduate-level examinations in the biomedical sciences (seven blinded), finding that GPT-4 scores exceed the student average in seven of nine cases and exceed all student scores for four exams. GPT-4 performed very well on fill-in-the-blank, short-answer, and essay questions, and correctly answered several questions on figures sourced from published manuscripts. Conversely, GPT-4 performed poorly on questions with figures containing simulated data and those requiring a hand-drawn answer. Two GPT-4 answer-sets were flagged as plagiarism based on answer similarity and some model responses included detailed hallucinations. In addition to assessing GPT-4 performance, we discuss patterns and limitations in GPT-4 capabilities with the goal of informing design of future academic examinations in the chatbot era.
Collapse
Affiliation(s)
- Daniel Stribling
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32610, USA.
- UF Genetics Institute, University of Florida, Gainesville, FL, 32610, USA.
- UF Health Cancer Center, University of Florida, Gainesville, FL, 32610, USA.
| | - Yuxing Xia
- Department of Neuroscience, Center for Translational Research in Neurodegenerative Disease, College of Medicine, University of Florida, Gainesville, FL, 32610, USA
- Department of Neurology, UCLA, Los Angeles, CA, 90095, USA
| | - Maha K Amer
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32610, USA
| | - Kiley S Graim
- Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL, 32610, USA
| | - Connie J Mulligan
- UF Genetics Institute, University of Florida, Gainesville, FL, 32610, USA
- Department of Anthropology, University of Florida, Gainesville, FL, 32610, USA
| | - Rolf Renne
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32610, USA.
- UF Genetics Institute, University of Florida, Gainesville, FL, 32610, USA.
- UF Health Cancer Center, University of Florida, Gainesville, FL, 32610, USA.
| |
Collapse
|
49
|
Ambrosio L, Schol J, La Pietra VA, Russo F, Vadalà G, Sakai D. Threats and opportunities of using ChatGPT in scientific writing-The risk of getting spine less. JOR Spine 2024; 7:e1296. [PMID: 38222818 PMCID: PMC10782071 DOI: 10.1002/jsp2.1296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 09/10/2023] [Accepted: 09/17/2023] [Indexed: 01/16/2024] Open
Abstract
ChatGPT and AI chatbots are revolutionizing several science fields, including medical writing. However, the inadequate use of such advantageous tools can raise numerous methodological and ethical issues.
Collapse
Affiliation(s)
- Luca Ambrosio
- Operative Research Unit of Orthopaedic and Trauma SurgeryFondazione Policlinico Universitario Campus Bio‐MedicoRomeItaly
- Research Unit of Orthopaedic and Trauma Surgery, Department of Medicine and SurgeryUniversità Campus Bio‐Medico di RomaRomeItaly
- Department of Orthopaedic SurgeryTokai University School of MedicineIseharaJapan
| | - Jordy Schol
- Department of Orthopaedic SurgeryTokai University School of MedicineIseharaJapan
| | | | - Fabrizio Russo
- Operative Research Unit of Orthopaedic and Trauma SurgeryFondazione Policlinico Universitario Campus Bio‐MedicoRomeItaly
- Research Unit of Orthopaedic and Trauma Surgery, Department of Medicine and SurgeryUniversità Campus Bio‐Medico di RomaRomeItaly
| | - Gianluca Vadalà
- Operative Research Unit of Orthopaedic and Trauma SurgeryFondazione Policlinico Universitario Campus Bio‐MedicoRomeItaly
- Research Unit of Orthopaedic and Trauma Surgery, Department of Medicine and SurgeryUniversità Campus Bio‐Medico di RomaRomeItaly
| | - Daisuke Sakai
- Department of Orthopaedic SurgeryTokai University School of MedicineIseharaJapan
| |
Collapse
|
50
|
Lee GU, Hong DY, Kim SY, Kim JW, Lee YH, Park SO, Lee KR. Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank. Medicine (Baltimore) 2024; 103:e37325. [PMID: 38428889 PMCID: PMC10906566 DOI: 10.1097/md.0000000000037325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 01/31/2024] [Indexed: 03/03/2024] Open
Abstract
Large language models (LLMs) have been deployed in diverse fields, and the potential for their application in medicine has been explored through numerous studies. This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Emergency Medicine Board Examination question bank in the Korean language. Of the 2353 questions in the question bank, 150 questions were randomly selected, and 27 containing figures were excluded. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. The answers and explanations obtained by inputting the 123 questions into the LLMs were analyzed and compared. ChatGPT-4 (75.6%) and Bing Chat (70.7%) showed higher correct response rates than ChatGPT-3.5 (56.9%) and Bard (51.2%). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (75.6%, 68.3%, 52.8%, and 50.4%, respectively). ChatGPT-4 and Bing Chat outperformed ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language.
Collapse
Affiliation(s)
- Go Un Lee
- Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea
| | - Dae Young Hong
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| | - Sin Young Kim
- Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea
| | - Jong Won Kim
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| | - Young Hwan Lee
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| | - Sang O Park
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| | - Kyeong Ryong Lee
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| |
Collapse
|