1
|
Ciudad-Fernández V, von Hammerstein C, Billieux J. People are not becoming "AIholic": Questioning the "ChatGPT addiction" construct. Addict Behav 2025; 166:108325. [PMID: 40073725 DOI: 10.1016/j.addbeh.2025.108325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Revised: 03/05/2025] [Accepted: 03/06/2025] [Indexed: 03/14/2025]
Abstract
Generative artificial intelligence (AI) chatbots such as ChatGPT have rapidly gained popularity in many daily life spheres, even sparking scholarly debate about a potential "ChatGPT addiction." Throughout history, new technologies have repeatedly been associated with widespread concerns and "moral panics," especially when their adoption is sudden and involves significant changes in daily functioning. It is thus no surprise that researchers have examined whether intensive use of ChatGPT can be considered an addictive behavior. At least four scales measuring ChatGPT addiction have been developed so far, all framed after substance use disorder criteria. Drawing parallels with previous cases of pathologizing everyday behaviors, we caution against labeling and defining intensive or habitual chatbot use as addictive behavior. To label a behavior as addictive, there must be convincing evidence of negative consequences, impaired control, psychological distress, and functional impairment. However, the existing research on problematic use of ChatGPT or other conversational AI bots fails to provide such robust scientific evidence. Caution is thus warranted to avoid (over)pathologization, inappropriate or unnecessary treatments, and excessive regulation of tools that have many benefits when used in a mindful and regulated manner.
Collapse
Affiliation(s)
- Víctor Ciudad-Fernández
- Department of Personality, Evaluation, and Psychological Treatments, University of Valencia, Valencia, Spain; Polibienestar Institute, University of Valencia, Valencia, Spain.
| | - Cora von Hammerstein
- Department of Psychiatry and Addiction Medicine, Fernand Widal Hospital APHP, Paris, France; Paris Cité University, INSERM, therapeutic optimization in neuropharmacology OPTEN U1144, Paris, France.
| | - Joël Billieux
- Institute of Psychology, University of Lausanne, Lausanne, Switzerland; Center for Excessive Gambling, Addiction Medicine, Lausanne University Hospital (CHUV), Lausanne, Switzerland.
| |
Collapse
|
2
|
Shirani M. Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time. J Prosthet Dent 2025:S0022-3913(25)00400-7. [PMID: 40410043 DOI: 10.1016/j.prosdent.2025.04.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2025] [Revised: 04/29/2025] [Accepted: 04/30/2025] [Indexed: 05/25/2025]
Abstract
STATEMENT OF PROBLEM The accuracy of DeepSeek and the latest versions of ChatGPT and Gemini in responding to prosthodontics questions needs to be evaluated. Additionally, the extent to which the performance of these chatbots changes through user interactions remains unexplored. PURPOSE The purpose of this longitudinal repeated-measures experimental study was to compare the performance of ChatGPT (4o), DeepSeek (R1), and Gemini (2 Pro) in answering multiple-choice (MC) and short-answer (SA) fixed prosthodontics questions over 4 consecutive weeks after exposure to correct responses. MATERIAL AND METHODS A total of 40 questions (20 MC and 20 SA) were developed based on the sixth edition of Contemporary Fixed Prosthodontics. Following a standardized protocol, these questions were posed to ChatGPT, DeepSeek, and Gemini on 4 consecutive Saturdays using 10 independent accounts per chatbot. After each session, correct answers were provided to the chatbots, and, before the next session, their memory and history were cleared. Responses were scored as correct (1) or incorrect (0) for MC questions and correct (2), partially correct (1), or incorrect (0) for SA questions. Weighted accuracy was calculated accordingly. The Kendall W coefficient was used to assess agreement among the 10 accounts per chatbot. The effects of chatbot type, time (week), and their interaction on performance were analyzed using generalized estimating equations (GEEs), followed by pairwise comparisons using the Mann-Whitney U test and Wilcoxon signed-rank test with Bonferroni adjustments for multiple comparisons (α=.05). RESULTS All chatbots showed significant reproducibility, with Gemini exhibiting the highest repeatability for SA questions, followed by ChatGPT for MC questions. Accuracy ranged between 43% and 71%. ChatGPT and DeepSeek demonstrated significantly better performance in MC questions compared with Gemini (P<.017). However, in the third week, Gemini outperformed DeepSeek in SA questions (P=.007). Over time, Gemini showed continuous improvement in SA questions, whereas DeepSeek exhibited a performance surge in the fourth week. ChatGPT's performance remained stable throughout the study period. CONCLUSIONS The overall accuracy of the studied chatbots in answering MC and SA prosthodontics questions was not satisfactory. Among them, ChatGPT was the most reliable for MC questions, while ChatGPT and Gemini performed best for SA questions. Gemini (for SA questions) and DeepSeek (for MC and SA questions) demonstrated improvement after exposure to correct responses.
Collapse
Affiliation(s)
- Mohammadjavad Shirani
- Assistant Professor, Department of Restorative Dentistry, Maurice H. Kornberg School of Dentistry, Temple University, Philadelphia, PA, United States.
| |
Collapse
|
3
|
Chokkakula S, Chong S, Yang B, Jiang H, Yu J, Han R, Attitalla IH, Yin C, Zhang S. Quantum leap in medical mentorship: exploring ChatGPT's transition from textbooks to terabytes. Front Med (Lausanne) 2025; 12:1517981. [PMID: 40375935 PMCID: PMC12079582 DOI: 10.3389/fmed.2025.1517981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2024] [Accepted: 02/28/2025] [Indexed: 05/18/2025] Open
Abstract
ChatGPT, an advanced AI language model, presents a transformative opportunity in several fields including the medical education. This article examines the integration of ChatGPT into healthcare learning environments, exploring its potential to revolutionize knowledge acquisition, personalize education, support curriculum development, and enhance clinical reasoning. The AI's ability to swiftly access and synthesize medical information across various specialties offers significant value to students and professionals alike. It provides rapid answers to queries on medical theories, treatment guidelines, and diagnostic methods, potentially accelerating the learning curve. The paper emphasizes the necessity of verifying ChatGPT's outputs against authoritative medical sources. A key advantage highlighted is the AI's capacity to tailor learning experiences by assessing individual needs, accommodating diverse learning styles, and offering personalized feedback. The article also considers ChatGPT's role in shaping curricula and assessment techniques, suggesting that educators may need to adapt their methods to incorporate AI-driven learning tools. Additionally, it explores how ChatGPT could bolster clinical problem-solving through AI-powered simulations, fostering critical thinking and diagnostic acumen among students. While recognizing ChatGPT's transformative potential in medical education, the article stresses the importance of thoughtful implementation, continuous validation, and the establishment of protocols to ensure its responsible and effective application in healthcare education settings.
Collapse
Affiliation(s)
- Santosh Chokkakula
- Department of Microbiology, Chungbuk National University College of Medicine and Medical Research Institute, Cheongju, Chungbuk, Republic of Korea
| | - Siomui Chong
- Department of Dermatology, The University of Hong Kong-Shenzhen Hospital, Shenzhen, China
- Department of Dermatology, The First Affiliated Hospital of Jinan University andJinan University Institute of Dermatology, Guangzhou, Guangdong, China
- Institute of Collaborative Innovation, University of Macau, Macao SAR, China
| | - Bing Yang
- Department of Cell Biology, College of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
- Department of Public Health, International School, Krirk University, Bangkok, Thailand
| | - Hong Jiang
- Statistical Office, Department of Operations, Zhuhai People's Hospital, Zhuhai Clinical Medical College of Jinan University, Zhuhai, China
| | - Juan Yu
- Department of Radiology, The First Affiliated Hospital of Shenzhen University, Health Science Center, Shenzhen Second People’s Hospital, Shenzhen, China
| | - Ruiqin Han
- State Key Laboratory of Common Mechanism Research for Major Diseases, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Idress Hamad Attitalla
- Department of Microbiology, Faculty of Science, Omar Al-Mukhtar University, AL-Bayda, Libya
| | - Chengliang Yin
- Medical Innovation Research Department, Chinese PLA General Hospital, Beijing, China
| | - Shuyao Zhang
- Department of Pharmacy, Guangzhou Red Cross Hospital of Jinan University, Guangzhou, China
| |
Collapse
|
4
|
Laurenzi M, Raffone A, Gallagher S, Chiarella SG. A multidimensional approach to the self in non-human animals through the Pattern Theory of Self. Front Psychol 2025; 16:1561420. [PMID: 40271366 PMCID: PMC12014599 DOI: 10.3389/fpsyg.2025.1561420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Accepted: 03/26/2025] [Indexed: 04/25/2025] Open
Abstract
In the last decades, research on animal consciousness has advanced significantly, fueled by interdisciplinary contributions. However, a critical dimension of animal experience remains underexplored: the self. While traditionally linked to human studies, research focused on the self in animals has often been framed dichotomously, distinguishing low-level, bodily, and affective aspects from high-level, cognitive, and conceptual dimensions. Emerging evidence suggests a broader spectrum of self-related features across species, yet current theoretical approaches often reduce the self to a derivative aspect of consciousness or prioritize narrow high-level dimensions, such as self-recognition or metacognition. To address this gap, we propose an integrated framework grounded in the Pattern Theory of Self (PTS). PTS conceptualizes the self as a dynamic, multidimensional construct arising from a matrix of dimensions, ranging from bodily and affective to intersubjective and normative aspects. We propose adopting this multidimensional perspective for the study of the self in animals, by emphasizing the graded nature of the self within each dimension and the non-hierarchical organization across dimensions. In this sense, PTS may accommodate both inter- and intra-species variability, enabling researchers to investigate the self across diverse organisms without relying on anthropocentric biases. We propose that, by integrating this framework with insights from comparative psychology, neuroscience, and ethology, the application of PTS to animals can show how the self emerges in varying degrees and forms, shaped by ecological niches and adaptive demands.
Collapse
Affiliation(s)
- Matteo Laurenzi
- Department of Psychology, Sapienza University of Rome, Rome, Italy
| | - Antonino Raffone
- Department of Psychology, Sapienza University of Rome, Rome, Italy
| | - Shaun Gallagher
- Department of Philosophy, University of Memphis, Memphis, TN, United States
- School of Liberal Arts (SOLA), University of Wollongong, Wollongong, NSW, Australia
| | - Salvatore G. Chiarella
- School of Liberal Arts (SOLA), University of Wollongong, Wollongong, NSW, Australia
- International School for Advanced Studies (SISSA), Trieste, Italy
| |
Collapse
|
5
|
Niriella MA, Premaratna P, Senanayake M, Kodisinghe S, Dassanayake U, Dassanayake A, Ediriweera DS, de Silva HJ. The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study. Expert Rev Gastroenterol Hepatol 2025; 19:437-442. [PMID: 39985424 DOI: 10.1080/17474124.2025.2471874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 02/12/2025] [Accepted: 02/21/2025] [Indexed: 02/24/2025]
Abstract
BACKGROUND We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information. RESEARCH DESIGN AND METHODS We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response. RESULTS The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, p = 0.811], completeness [H(2) = 3.146, p = 0.207], or quality [H(2) = 3.350, p = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, p = 0.062], completeness [H(2) = 0.104, p = 0.949], or quality [H(2) = 0.420, p = 0.810] between the three raters (R1, R2, R3). CONCLUSION Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.
Collapse
|
6
|
Chiu EKY, Chung TWH. Protocol for human evaluation of generative artificial intelligence chatbots in clinical consultations. PLoS One 2025; 20:e0300487. [PMID: 40106443 PMCID: PMC11922213 DOI: 10.1371/journal.pone.0300487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Accepted: 12/21/2024] [Indexed: 03/22/2025] Open
Abstract
BACKGROUND Generative artificial intelligence (GenAI) has the potential to revolutionise healthcare delivery. The nuances of real-life clinical practice and complex clinical environments demand a rigorous, evidence-based approach to ensure safe and effective deployment of AI. METHODS We present a protocol for the systematic evaluation of large language models (LLMs) as GenAI chatbots within the context of clinical microbiology and infectious diseases clinical consultations. We aim to critically assess recommendations produced by four leading GenAI models, including Claude 2, Gemini Pro, GPT-4.0, and a GPT-4.0-based custom AI chatbot. DISCUSSION A standardised, healthcare-specific, universal prompt template is developed to elicit clinically impactful AI responses. Generated responses will be graded by two panels of practicing clinicians, encompassing a wide spectrum of domain expertise in clinical microbiology and virology, as well as infectious diseases. Evaluations will be performed using a 5-point Likert scale across four clinical domains: factual consistency, comprehensiveness, coherence, and medical harmfulness. Our study will offer insights into the feasibility, limitations, and boundaries of GenAI in clinical consultations, providing guidance for future research and clinical implementation. Ethical guidelines and safety guardrails should be developed to uphold patient safety and clinical standards.
Collapse
Affiliation(s)
- Edwin Kwan-Yeung Chiu
- Department of Microbiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Tom Wai-Hin Chung
- Department of Microbiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| |
Collapse
|
7
|
Onciul R, Tataru CI, Dumitru AV, Crivoi C, Serban M, Covache-Busuioc RA, Radoi MP, Toader C. Artificial Intelligence and Neuroscience: Transformative Synergies in Brain Research and Clinical Applications. J Clin Med 2025; 14:550. [PMID: 39860555 PMCID: PMC11766073 DOI: 10.3390/jcm14020550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2024] [Revised: 01/10/2025] [Accepted: 01/14/2025] [Indexed: 01/27/2025] Open
Abstract
The convergence of Artificial Intelligence (AI) and neuroscience is redefining our understanding of the brain, unlocking new possibilities in research, diagnosis, and therapy. This review explores how AI's cutting-edge algorithms-ranging from deep learning to neuromorphic computing-are revolutionizing neuroscience by enabling the analysis of complex neural datasets, from neuroimaging and electrophysiology to genomic profiling. These advancements are transforming the early detection of neurological disorders, enhancing brain-computer interfaces, and driving personalized medicine, paving the way for more precise and adaptive treatments. Beyond applications, neuroscience itself has inspired AI innovations, with neural architectures and brain-like processes shaping advances in learning algorithms and explainable models. This bidirectional exchange has fueled breakthroughs such as dynamic connectivity mapping, real-time neural decoding, and closed-loop brain-computer systems that adaptively respond to neural states. However, challenges persist, including issues of data integration, ethical considerations, and the "black-box" nature of many AI systems, underscoring the need for transparent, equitable, and interdisciplinary approaches. By synthesizing the latest breakthroughs and identifying future opportunities, this review charts a path forward for the integration of AI and neuroscience. From harnessing multimodal data to enabling cognitive augmentation, the fusion of these fields is not just transforming brain science, it is reimagining human potential. This partnership promises a future where the mysteries of the brain are unlocked, offering unprecedented advancements in healthcare, technology, and beyond.
Collapse
Affiliation(s)
- Razvan Onciul
- Department of Neurosurgery, “Carol Davila” University of Medicine and Pharmacy, 020021 Bucharest, Romania; (R.O.); (M.S.); (R.-A.C.-B.); (M.P.R.); (C.T.)
- Neurosurgery Department, Emergency University Hospital, 050098 Bucharest, Romania
| | - Catalina-Ioana Tataru
- Clinical Department of Ophthalmology, “Carol Davila” University of Medicine and Pharmacy, 020021 Bucharest, Romania
- Department of Ophthalmology, Clinical Hospital for Ophthalmological Emergencies, 010464 Bucharest, Romania
| | - Adrian Vasile Dumitru
- Department of Neurosurgery, “Carol Davila” University of Medicine and Pharmacy, 020021 Bucharest, Romania; (R.O.); (M.S.); (R.-A.C.-B.); (M.P.R.); (C.T.)
- Department of Morphopathology, “Carol Davila” University of Medicine and Pharmacy, 020021 Bucharest, Romania
- Emergency University Hospital, 050098 Bucharest, Romania
| | - Carla Crivoi
- Department of Computer Science, Faculty of Mathematics and Computer Science, University of Bucharest, 010014 Bucharest, Romania;
| | - Matei Serban
- Department of Neurosurgery, “Carol Davila” University of Medicine and Pharmacy, 020021 Bucharest, Romania; (R.O.); (M.S.); (R.-A.C.-B.); (M.P.R.); (C.T.)
- Department of Vascular Neurosurgery, National Institute of Neurovascular Disease, 077160 Bucharest, Romania
- Puls Med Association, 051885 Bucharest, Romania
| | - Razvan-Adrian Covache-Busuioc
- Department of Neurosurgery, “Carol Davila” University of Medicine and Pharmacy, 020021 Bucharest, Romania; (R.O.); (M.S.); (R.-A.C.-B.); (M.P.R.); (C.T.)
- Department of Vascular Neurosurgery, National Institute of Neurovascular Disease, 077160 Bucharest, Romania
- Puls Med Association, 051885 Bucharest, Romania
| | - Mugurel Petrinel Radoi
- Department of Neurosurgery, “Carol Davila” University of Medicine and Pharmacy, 020021 Bucharest, Romania; (R.O.); (M.S.); (R.-A.C.-B.); (M.P.R.); (C.T.)
- Department of Vascular Neurosurgery, National Institute of Neurovascular Disease, 077160 Bucharest, Romania
| | - Corneliu Toader
- Department of Neurosurgery, “Carol Davila” University of Medicine and Pharmacy, 020021 Bucharest, Romania; (R.O.); (M.S.); (R.-A.C.-B.); (M.P.R.); (C.T.)
- Department of Vascular Neurosurgery, National Institute of Neurovascular Disease, 077160 Bucharest, Romania
| |
Collapse
|
8
|
Meyer A, Soleman A, Riese J, Streichert T. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum. Clin Chem Lab Med 2024; 62:2425-2434. [PMID: 38804035 DOI: 10.1515/cclm-2024-0246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 05/13/2024] [Indexed: 05/29/2024]
Abstract
OBJECTIVES Laboratory medical reports are often not intuitively comprehensible to non-medical professionals. Given their recent advancements, easier accessibility and remarkable performance on medical licensing exams, patients are therefore likely to turn to artificial intelligence-based chatbots to understand their laboratory results. However, empirical studies assessing the efficacy of these chatbots in responding to real-life patient queries regarding laboratory medicine are scarce. METHODS Thus, this investigation included 100 patient inquiries from an online health forum, specifically addressing Complete Blood Count interpretation. The aim was to evaluate the proficiency of three artificial intelligence-based chatbots (ChatGPT, Gemini and Le Chat) against the online responses from certified physicians. RESULTS The findings revealed that the chatbots' interpretations of laboratory results were inferior to those from online medical professionals. While the chatbots exhibited a higher degree of empathetic communication, they frequently produced erroneous or overly generalized responses to complex patient questions. The appropriateness of chatbot responses ranged from 51 to 64 %, with 22 to 33 % of responses overestimating patient conditions. A notable positive aspect was the chatbots' consistent inclusion of disclaimers regarding its non-medical nature and recommendations to seek professional medical advice. CONCLUSIONS The chatbots' interpretations of laboratory results from real patient queries highlight a dangerous dichotomy - a perceived trustworthiness potentially obscuring factual inaccuracies. Given the growing inclination towards self-diagnosis using AI platforms, further research and improvement of these chatbots is imperative to increase patients' awareness and avoid future burdens on the healthcare system.
Collapse
Affiliation(s)
- Annika Meyer
- Institute of Clinical Chemistry, Faculty of Medicine and University Hospital, 27182 University Hospital Cologne , Cologne, Germany
| | - Ari Soleman
- Faculty of Medicine and University Hospital, 27182 University Hospital Cologne , Cologne, Germany
| | - Janik Riese
- Institute of Pathology, Faculty of Medicine, RWTH Aachen University, Aachen, Germany
| | - Thomas Streichert
- Institute of Clinical Chemistry, Faculty of Medicine and University Hospital, 27182 University Hospital Cologne , Cologne, Germany
| |
Collapse
|
9
|
Chatzopoulos GS, Koidou VP, Tsalikis L, Kaklamanos EG. Large language models in periodontology: Assessing their performance in clinically relevant questions. J Prosthet Dent 2024:S0022-3913(24)00714-5. [PMID: 39562221 DOI: 10.1016/j.prosdent.2024.10.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 10/16/2024] [Accepted: 10/18/2024] [Indexed: 11/21/2024]
Abstract
STATEMENT OF PROBLEM Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably. PURPOSE The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology. MATERIAL AND METHODS A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs. RESULTS The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance. CONCLUSIONS Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.
Collapse
Affiliation(s)
- Georgios S Chatzopoulos
- PhD candidate, Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece; and Visiting Research Assistant Professor, Division of Periodontology, Department of Developmental and Surgical Sciences, School of Dentistry, University of Minnesota, Minneapolis, Minn.
| | - Vasiliki P Koidou
- Research Assistant, Centre for Oral Immunobiology and Regenerative Medicine and Centre for Oral Clinical Research, Institute of Dentistry, Queen Mary University London (QMUL), London, England, UK
| | - Lazaros Tsalikis
- Professor, Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Eleftherios G Kaklamanos
- Associate Professor, Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece; Associate Professor, School of Dentistry, European University Cyprus, Nicosia, Cyprus; and Adjunct Associate Professor, Hamdan bin Mohammed College of Dental Medicine, Mohammed bin Rashid University of Medicine and Health Sciences (MBRU), Dubai, United Arab Emirates
| |
Collapse
|
10
|
Li Y, Peng X, Li J, Zuo X, Peng S, Pei D, Tao C, Xu H, Hong N. Relation extraction using large language models: a case study on acupuncture point locations. J Am Med Inform Assoc 2024; 31:2622-2631. [PMID: 39208311 PMCID: PMC11491641 DOI: 10.1093/jamia/ocae233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 07/29/2024] [Accepted: 08/18/2024] [Indexed: 09/04/2024] Open
Abstract
OBJECTIVE In acupuncture therapy, the accurate location of acupoints is essential for its effectiveness. The advanced language understanding capabilities of large language models (LLMs) like Generative Pre-trained Transformers (GPTs) and Llama present a significant opportunity for extracting relations related to acupoint locations from textual knowledge sources. This study aims to explore the performance of LLMs in extracting acupoint-related location relations and assess the impact of fine-tuning on GPT's performance. MATERIALS AND METHODS We utilized the World Health Organization Standard Acupuncture Point Locations in the Western Pacific Region (WHO Standard) as our corpus, which consists of descriptions of 361 acupoints. Five types of relations ("direction_of", "distance_of", "part_of", "near_acupoint", and "located_near") (n = 3174) between acupoints were annotated. Four models were compared: pre-trained GPT-3.5, fine-tuned GPT-3.5, pre-trained GPT-4, as well as pretrained Llama 3. Performance metrics included micro-average exact match precision, recall, and F1 scores. RESULTS Our results demonstrate that fine-tuned GPT-3.5 consistently outperformed other models in F1 scores across all relation types. Overall, it achieved the highest micro-average F1 score of 0.92. DISCUSSION The superior performance of the fine-tuned GPT-3.5 model, as shown by its F1 scores, underscores the importance of domain-specific fine-tuning in enhancing relation extraction capabilities for acupuncture-related tasks. In light of the findings from this study, it offers valuable insights into leveraging LLMs for developing clinical decision support and creating educational modules in acupuncture. CONCLUSION This study underscores the effectiveness of LLMs like GPT and Llama in extracting relations related to acupoint locations, with implications for accurately modeling acupuncture knowledge and promoting standard implementation in acupuncture training and practice. The findings also contribute to advancing informatics applications in traditional and complementary medicine, showcasing the potential of LLMs in natural language processing.
Collapse
Affiliation(s)
- Yiming Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Xueqing Peng
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| | - Jianfu Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL 32224, United States
| | - Xu Zuo
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Suyuan Peng
- Institute of Information on Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing 100010, China
| | - Donghong Pei
- The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Cui Tao
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL 32224, United States
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| | - Na Hong
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| |
Collapse
|
11
|
Yoon SH, Oh SK, Lim BG, Lee HJ. Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study. JMIR MEDICAL EDUCATION 2024; 10:e56859. [PMID: 39284182 PMCID: PMC11443200 DOI: 10.2196/56859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 06/10/2024] [Accepted: 08/15/2024] [Indexed: 10/04/2024]
Abstract
BACKGROUND ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. OBJECTIVE This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. METHODS We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4's problem-solving proficiency using both the original Korean texts and their English translations. RESULTS A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). CONCLUSIONS This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings.
Collapse
Affiliation(s)
- Soo-Hyuk Yoon
- Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Seok Kyeong Oh
- Department of Anesthesiology and Pain Medicine, Korea University Guro Hospital, Korea University College of Medicine, Seoul, Republic of Korea
| | - Byung Gun Lim
- Department of Anesthesiology and Pain Medicine, Korea University Guro Hospital, Korea University College of Medicine, Seoul, Republic of Korea
| | - Ho-Jin Lee
- Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
12
|
Akyon SH, Akyon FC, Camyar AS, Hızlı F, Sari T, Hızlı Ş. Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study. JMIR Med Inform 2024; 12:e59258. [PMID: 39230947 PMCID: PMC11411230 DOI: 10.2196/59258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Revised: 06/16/2024] [Accepted: 07/05/2024] [Indexed: 09/05/2024] Open
Abstract
BACKGROUND Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed. OBJECTIVE This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study. METHODS The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs' understanding of different sections of a research paper. RESULTS LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper-with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding. CONCLUSIONS This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models.
Collapse
Affiliation(s)
| | - Fatih Cagatay Akyon
- SafeVideo AI, San Francisco, CA, United States
- Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Ahmet Sefa Camyar
- Department of Internal Medicine, Ankara Etlik City Hospital, Ankara, Turkey
| | - Fatih Hızlı
- Faculty of Medicine, Ankara Yildirim Beyazit University, Ankara, Turkey
| | - Talha Sari
- SafeVideo AI, San Francisco, CA, United States
- Department of Computer Science, Istanbul Technical University, Istanbul, Turkey
| | - Şamil Hızlı
- Department of Pediatric Gastroenterology, Children Hospital, Ankara Bilkent City Hospital, Ankara Yildirim Beyazit University, Ankara, Turkey
| |
Collapse
|
13
|
Langston E, Charness N, Boot W. Are Virtual Assistants Trustworthy for Medicare Information: An Examination of Accuracy and Reliability. THE GERONTOLOGIST 2024; 64:gnae062. [PMID: 38832398 PMCID: PMC11258897 DOI: 10.1093/geront/gnae062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND AND OBJECTIVES Advances in artificial intelligence (AI)-based virtual assistants provide a potential opportunity for older adults to use this technology in the context of health information-seeking. Meta-analysis on trust in AI shows that users are influenced by the accuracy and reliability of the AI trustee. We evaluated these dimensions for responses to Medicare queries. RESEARCH DESIGN AND METHODS During the summer of 2023, we assessed the accuracy and reliability of Alexa, Google Assistant, Bard, and ChatGPT-4 on Medicare terminology and general content from a large, standardized question set. We compared the accuracy of these AI systems to that of a large representative sample of Medicare beneficiaries who were queried twenty years prior. RESULTS Alexa and Google Assistant were found to be highly inaccurate when compared to beneficiaries' mean accuracy of 68.4% on terminology queries and 53.0% on general Medicare content. Bard and ChatGPT-4 answered Medicare terminology queries perfectly and performed much better on general Medicare content queries (Bard = 96.3%, ChatGPT-4 = 92.6%) than the average Medicare beneficiary. About one month to a month-and-a-half later, we found that Bard and Alexa's accuracy stayed the same, whereas ChatGPT-4's performance nominally decreased, and Google Assistant's performance nominally increased. DISCUSSION AND IMPLICATIONS LLM-based assistants generate trustworthy information in response to carefully phrased queries about Medicare, in contrast to Alexa and Google Assistant. Further studies will be needed to determine what factors beyond accuracy and reliability influence the adoption and use of such technology for Medicare decision-making.
Collapse
Affiliation(s)
- Emily Langston
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| | - Neil Charness
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| | - Walter Boot
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
14
|
Almulla MA. Investigating influencing factors of learning satisfaction in AI ChatGPT for research: University students perspective. Heliyon 2024; 10:e32220. [PMID: 38933954 PMCID: PMC11200296 DOI: 10.1016/j.heliyon.2024.e32220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 05/21/2024] [Accepted: 05/29/2024] [Indexed: 06/28/2024] Open
Abstract
This study investigates the determinants of ChatGPT adoption among university students and its impact on learning satisfaction. Utilizing the Technology Acceptance Model (TAM) and incorporating insights from interaction learning, collaborative learning, and information quality, a structural equation modeling approach was employed. This research collected valuable responses from 262 students at King Faisal University in Saudi Arabia through the use of self-report questionnaires. The data's reliability and validity were assessed using confirmation factor analysis, followed by path analysis to explore the hypotheses in the proposed model. The results indicate the pivotal roles of interaction learning and collaborative learning in fostering ChatGPT adoption. Social interaction played a significant role, as researchers engaging in conversations and knowledge-sharing expressed increased comfort with ChatGPT. Information quality was found to substantially influence researchers' decisions to continue using ChatGPT, emphasizing the need for ongoing improvement in the accuracy and relevance of content provided. Perceived ease of use and perceived usefulness played intermediary roles in linking ChatGPT engagement to learning satisfaction. User-friendly interfaces and perceived utility were identified as crucial factors affecting overall satisfaction levels. Notably, ChatGPT positively impacted learning motivation, indicating its potential to enhance student engagement and interest in learning. The study's findings have implications for educational practitioners seeking to improve the implementation of AI technologies in university students, emphasizing user-friendly design, collaborative learning, and factors influencing satisfaction. The study concludes with insights into the complex interplay between AI-powered tools, learning objectives, and motivation, highlighting the need for continued research to comprehensively understand these dynamics.
Collapse
Affiliation(s)
- Mohammed Abdullatif Almulla
- Department of Curriculum and Instruction, Faculty of Education, King Faisal University, Al Ahsa, 31982, Saudi Arabia
| |
Collapse
|
15
|
Shin E, Yu Y, Bies RR, Ramanathan M. Evaluation of ChatGPT and Gemini large language models for pharmacometrics with NONMEM. J Pharmacokinet Pharmacodyn 2024; 51:187-197. [PMID: 38656706 DOI: 10.1007/s10928-024-09921-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Accepted: 04/16/2024] [Indexed: 04/26/2024]
Abstract
To assess ChatGPT 4.0 (ChatGPT) and Gemini Ultra 1.0 (Gemini) large language models on NONMEM coding tasks relevant to pharmacometrics and clinical pharmacology. ChatGPT and Gemini were assessed on tasks mimicking real-world applications of NONMEM. The tasks ranged from providing a curriculum for learning NONMEM, an overview of NONMEM code structure to generating code. Prompts in lay language to elicit NONMEM code for a linear pharmacokinetic (PK) model with oral administration and a more complex model with two parallel first-order absorption mechanisms were investigated. Reproducibility and the impact of "temperature" hyperparameter settings were assessed. The code was reviewed by two NONMEM experts. ChatGPT and Gemini provided NONMEM curriculum structures combining foundational knowledge with advanced concepts (e.g., covariate modeling and Bayesian approaches) and practical skills including NONMEM code structure and syntax. ChatGPT provided an informative summary of the NONMEM control stream structure and outlined the key NONMEM Translator (NM-TRAN) records needed. ChatGPT and Gemini were able to generate code blocks for the NONMEM control stream from the lay language prompts for the two coding tasks. The control streams contained focal structural and syntax errors that required revision before they could be executed without errors and warnings. The code output from ChatGPT and Gemini was not reproducible, and varying the temperature hyperparameter did not reduce the errors and omissions substantively. Large language models may be useful in pharmacometrics for efficiently generating an initial coding template for modeling projects. However, the output can contain errors and omissions that require correction.
Collapse
Affiliation(s)
- Euibeom Shin
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, Buffalo, NY, 14214-8033, USA
| | - Yifan Yu
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, Buffalo, NY, 14214-8033, USA
| | - Robert R Bies
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, Buffalo, NY, 14214-8033, USA
| | - Murali Ramanathan
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, Buffalo, NY, 14214-8033, USA.
| |
Collapse
|
16
|
Jo H, Park DH. Effects of ChatGPT's AI capabilities and human-like traits on spreading information in work environments. Sci Rep 2024; 14:7806. [PMID: 38565880 PMCID: PMC10987623 DOI: 10.1038/s41598-024-57977-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Accepted: 03/23/2024] [Indexed: 04/04/2024] Open
Abstract
The rapid proliferation and integration of AI chatbots in office environments, specifically the advanced AI model ChatGPT, prompts an examination of how its features and updates impact knowledge processes, satisfaction, and word-of-mouth (WOM) among office workers. This study investigates the determinants of WOM among office workers who are users of ChatGPT. We adopted a quantitative approach, utilizing a stratified random sampling technique to collect data from a diverse group of office workers experienced in using ChatGPT. The hypotheses were rigorously tested through Structural Equation Modeling (SEM) using the SmartPLS 4. The results revealed that system updates, memorability, and non-language barrier attributes of ChatGPT significantly enhanced knowledge acquisition and application. Additionally, the human-like personality traits of ChatGPT significantly increased both utilitarian value and satisfaction. Furthermore, the study showed that knowledge acquisition and application led to a significant increase in utilitarian value and satisfaction, which subsequently increased WOM. Age had a positive influence on WOM, while gender had no significant impact. The findings provide theoretical contributions by expanding our understanding of AI chatbots' role in knowledge processes, satisfaction, and WOM, particularly among office workers.
Collapse
Affiliation(s)
- Hyeon Jo
- Headquarters, HJ Institute of Technology and Management, 71 Jungdong-ro 39, Bucheon-si, Gyeonggi-do, 14721, Republic of Korea
| | - Do-Hyung Park
- Graduate School of Business IT, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul, 02707, Republic of Korea.
| |
Collapse
|
17
|
Valentini M, Szkandera J, Smolle MA, Scheipl S, Leithner A, Andreou D. Artificial intelligence large language model ChatGPT: is it a trustworthy and reliable source of information for sarcoma patients? Front Public Health 2024; 12:1303319. [PMID: 38584922 PMCID: PMC10995284 DOI: 10.3389/fpubh.2024.1303319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Accepted: 03/06/2024] [Indexed: 04/09/2024] Open
Abstract
Introduction Since its introduction in November 2022, the artificial intelligence large language model ChatGPT has taken the world by storm. Among other applications it can be used by patients as a source of information on diseases and their treatments. However, little is known about the quality of the sarcoma-related information ChatGPT provides. We therefore aimed at analyzing how sarcoma experts evaluate the quality of ChatGPT's responses on sarcoma-related inquiries and assess the bot's answers in specific evaluation metrics. Methods The ChatGPT responses to a sample of 25 sarcoma-related questions (5 definitions, 9 general questions, and 11 treatment-related inquiries) were evaluated by 3 independent sarcoma experts. Each response was compared with authoritative resources and international guidelines and graded on 5 different metrics using a 5-point Likert scale: completeness, misleadingness, accuracy, being up-to-date, and appropriateness. This resulted in maximum 25 and minimum 5 points per answer, with higher scores indicating a higher response quality. Scores ≥21 points were rated as very good, between 16 and 20 as good, while scores ≤15 points were classified as poor (11-15) and very poor (≤10). Results The median score that ChatGPT's answers achieved was 18.3 points (IQR, i.e., Inter-Quartile Range, 12.3-20.3 points). Six answers were classified as very good, 9 as good, while 5 answers each were rated as poor and very poor. The best scores were documented in the evaluation of how appropriate the response was for patients (median, 3.7 points; IQR, 2.5-4.2 points), which were significantly higher compared to the accuracy scores (median, 3.3 points; IQR, 2.0-4.2 points; p = 0.035). ChatGPT fared considerably worse with treatment-related questions, with only 45% of its responses classified as good or very good, compared to general questions (78% of responses good/very good) and definitions (60% of responses good/very good). Discussion The answers ChatGPT provided on a rare disease, such as sarcoma, were found to be of very inconsistent quality, with some answers being classified as very good and others as very poor. Sarcoma physicians should be aware of the risks of misinformation that ChatGPT poses and advise their patients accordingly.
Collapse
Affiliation(s)
- Marisa Valentini
- Department of Orthopaedics and Trauma, Medical University of Graz, Graz, Austria
| | - Joanna Szkandera
- Division of Oncology, Department of Internal Medicine, Medical University of Graz, Graz, Austria
| | - Maria Anna Smolle
- Department of Orthopaedics and Trauma, Medical University of Graz, Graz, Austria
| | - Susanne Scheipl
- Department of Orthopaedics and Trauma, Medical University of Graz, Graz, Austria
| | - Andreas Leithner
- Department of Orthopaedics and Trauma, Medical University of Graz, Graz, Austria
| | - Dimosthenis Andreou
- Department of Orthopaedics and Trauma, Medical University of Graz, Graz, Austria
| |
Collapse
|
18
|
Lee KH, Lee RW. ChatGPT's Accuracy on Magnetic Resonance Imaging Basics: Characteristics and Limitations Depending on the Question Type. Diagnostics (Basel) 2024; 14:171. [PMID: 38248048 PMCID: PMC10814518 DOI: 10.3390/diagnostics14020171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 01/04/2024] [Accepted: 01/11/2024] [Indexed: 01/23/2024] Open
Abstract
Our study aimed to assess the accuracy and limitations of ChatGPT in the domain of MRI, focused on evaluating ChatGPT's performance in answering simple knowledge questions and specialized multiple-choice questions related to MRI. A two-step approach was used to evaluate ChatGPT. In the first step, 50 simple MRI-related questions were asked, and ChatGPT's answers were categorized as correct, partially correct, or incorrect by independent researchers. In the second step, 75 multiple-choice questions covering various MRI topics were posed, and the answers were similarly categorized. The study utilized Cohen's kappa coefficient for assessing interobserver agreement. ChatGPT demonstrated high accuracy in answering straightforward MRI questions, with over 85% classified as correct. However, its performance varied significantly across multiple-choice questions, with accuracy rates ranging from 40% to 66.7%, depending on the topic. This indicated a notable gap in its ability to handle more complex, specialized questions requiring deeper understanding and context. In conclusion, this study critically evaluates the accuracy of ChatGPT in addressing questions related to Magnetic Resonance Imaging (MRI), highlighting its potential and limitations in the healthcare sector, particularly in radiology. Our findings demonstrate that ChatGPT, while proficient in responding to straightforward MRI-related questions, exhibits variability in its ability to accurately answer complex multiple-choice questions that require more profound, specialized knowledge of MRI. This discrepancy underscores the nuanced role AI can play in medical education and healthcare decision-making, necessitating a balanced approach to its application.
Collapse
Affiliation(s)
| | - Ro-Woon Lee
- Department of Radiology, Inha University College of Medicine, Incheon 22212, Republic of Korea;
| |
Collapse
|
19
|
Gravina AG, Pellegrino R, Cipullo M, Palladino G, Imperio G, Ventura A, Auletta S, Ciamarra P, Federico A. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients' questions? An evidence-controlled analysis. World J Gastroenterol 2024; 30:17-33. [PMID: 38293321 PMCID: PMC10823903 DOI: 10.3748/wjg.v30.i1.17] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 12/07/2023] [Accepted: 12/28/2023] [Indexed: 01/06/2024] Open
Abstract
Artificial intelligence is increasingly entering everyday healthcare. Large language model (LLM) systems such as Chat Generative Pre-trained Transformer (ChatGPT) have become potentially accessible to everyone, including patients with inflammatory bowel diseases (IBD). However, significant ethical issues and pitfalls exist in innovative LLM tools. The hype generated by such systems may lead to unweighted patient trust in these systems. Therefore, it is necessary to understand whether LLMs (trendy ones, such as ChatGPT) can produce plausible medical information (MI) for patients. This review examined ChatGPT's potential to provide MI regarding questions commonly addressed by patients with IBD to their gastroenterologists. From the review of the outputs provided by ChatGPT, this tool showed some attractive potential while having significant limitations in updating and detailing information and providing inaccurate information in some cases. Further studies and refinement of the ChatGPT, possibly aligning the outputs with the leading medical evidence provided by reliable databases, are needed.
Collapse
Affiliation(s)
- Antonietta Gerarda Gravina
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Raffaele Pellegrino
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Marina Cipullo
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Giovanna Palladino
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Giuseppe Imperio
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Andrea Ventura
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Salvatore Auletta
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Paola Ciamarra
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Alessandro Federico
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| |
Collapse
|
20
|
Creswell J, Vo LNQ, Qin ZZ, Muyoyeta M, Tovar M, Wong EB, Ahmed S, Vijayan S, John S, Maniar R, Rahman T, MacPherson P, Banu S, Codlin AJ. Early user perspectives on using computer-aided detection software for interpreting chest X-ray images to enhance access and quality of care for persons with tuberculosis. BMC GLOBAL AND PUBLIC HEALTH 2023; 1:30. [PMID: 39681961 DOI: 10.1186/s44263-023-00033-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 12/06/2023] [Indexed: 12/18/2024]
Abstract
Despite 30 years as a public health emergency, tuberculosis (TB) remains one of the world's deadliest diseases. Most deaths are among persons with TB who are not reached with diagnosis and treatment. Thus, timely screening and accurate detection of TB, particularly using sensitive tools such as chest radiography, is crucial for reducing the global burden of this disease. However, lack of qualified human resources represents a common limiting factor in many high TB-burden countries. Artificial intelligence (AI) has emerged as a powerful complement in many facets of life, including for the interpretation of chest X-ray images. However, while AI may serve as a viable alternative to human radiographers and radiologists, there is a high likelihood that those suffering from TB will not reap the benefits of this technological advance without appropriate, clinically effective use and cost-conscious deployment. The World Health Organization recommended the use of AI for TB screening in 2021, and early adopters of the technology have been using the technology in many ways. In this manuscript, we present a compilation of early user experiences from nine high TB-burden countries focused on practical considerations and best practices related to deployment, threshold and use case selection, and scale-up. While we offer technical and operational guidance on the use of AI for interpreting chest X-ray images for TB detection, our aim remains to maximize the benefit that programs, implementers, and ultimately TB-affected individuals can derive from this innovative technology.
Collapse
Affiliation(s)
| | - Luan Nguyen Quang Vo
- Friends for International TB Relief (FIT), Hanoi, Vietnam
- Department of Global Health, WHO Collaboration Centre On Tuberculosis and Social Medicine, Karolinska Institutet, Stockholm, Sweden
| | | | - Monde Muyoyeta
- Centre for Infectious Disease Research in Zambia, Lusaka, Zambia
| | | | - Emily Beth Wong
- Africa Health Research Institute, KwaZulu-Natal, South Africa
- Division of Infectious Diseases, Heersink School of Medicine, University of Alabama Birmingham, Birmingham, AL, USA
| | - Shahriar Ahmed
- International Centre for Diarrhoeal Disease Research, Bangladesh (icddr,b), Dhaka, Bangladesh
| | | | | | - Rabia Maniar
- Interactive Research and Development (IRD) Pakistan, Karachi, Pakistan
| | | | - Peter MacPherson
- School of Health & Wellbeing, University of Glasgow, Glasgow, UK
- Malawi-Liverpool-Wellcome Trust Clinical Research Programme, Blantyre, Malawi
- London School of Hygiene & Tropical Medicine, London, UK
| | - Sayera Banu
- International Centre for Diarrhoeal Disease Research, Bangladesh (icddr,b), Dhaka, Bangladesh
| | - Andrew James Codlin
- Friends for International TB Relief (FIT), Hanoi, Vietnam
- Department of Global Health, WHO Collaboration Centre On Tuberculosis and Social Medicine, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
21
|
Sartori G, Orrù G. Language models and psychological sciences. Front Psychol 2023; 14:1279317. [PMID: 37941751 PMCID: PMC10629494 DOI: 10.3389/fpsyg.2023.1279317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 09/26/2023] [Indexed: 11/10/2023] Open
Abstract
Large language models (LLMs) are demonstrating impressive performance on many reasoning and problem-solving tasks from cognitive psychology. When tested, their accuracy is often on par with average neurotypical adults, challenging long-standing critiques of associative models. Here we analyse recent findings at the intersection of LLMs and cognitive science. Here we discuss how modern LLMs resurrect associationist principles, with abilities like long-distance associations enabling complex reasoning. While limitations remain in areas like causal cognition and planning, phenomena like emergence suggest room for growth. Providing examples and increasing the dimensions of the network are methods that further improve LLM abilities, mirroring facilitation effects in human cognition. Analysis of LLMs errors provides insight into human cognitive biases. Overall, we argue LLMs represent a promising development for cognitive modelling, enabling new explorations of the mechanisms underlying intelligence and reasoning from an associationist point of view. Carefully evaluating LLMs with the tools of cognitive psychology will further understand the building blocks of the human mind.
Collapse
Affiliation(s)
- Giuseppe Sartori
- Department of General Psychology, University of Padova, Padova, Italy
| | - Graziella Orrù
- Department of Surgical, Medical, Molecular and Critical Area Pathology, University of Pisa, Pisa, Italy
| |
Collapse
|