1
|
Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, Bensahla A, Bjelogrlic M, Lovis C. Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review. J Med Internet Res 2025; 27:e68998. [PMID: 40371947 PMCID: PMC12123242 DOI: 10.2196/68998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Revised: 02/21/2025] [Accepted: 03/12/2025] [Indexed: 05/16/2025] Open
Abstract
BACKGROUND Information overload in electronic health records requires effective solutions to alleviate clinicians' administrative tasks. Automatically summarizing clinical text has gained significant attention with the rise of large language models. While individual studies show optimism, a structured overview of the research landscape is lacking. OBJECTIVE This study aims to present the current state of the art on clinical text summarization using large language models, evaluate the level of evidence in existing research and assess the applicability of performance findings in clinical settings. METHODS This scoping review complied with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Literature published between January 1, 2019, and June 18, 2024, was identified from 5 databases: PubMed, Embase, Web of Science, IEEE Xplore, and ACM Digital Library. Studies were excluded if they did not describe transformer-based models, did not focus on clinical text summarization, did not engage with free-text data, were not original research, were nonretrievable, were not peer-reviewed, or were not in English, French, Spanish, or German. Data related to study context and characteristics, scope of research, and evaluation methodologies were systematically collected and analyzed by 3 authors independently. RESULTS A total of 30 original studies were included in the analysis. All used observational retrospective designs, mainly using real patient data (n=28, 93%). The research landscape demonstrated a narrow research focus, often centered on summarizing radiology reports (n=17, 57%), primarily involving data from the intensive care unit (n=15, 50%) of US-based institutions (n=19, 73%), in English (n=26, 87%). This focus aligned with the frequent reliance on the open-source Medical Information Mart for Intensive Care dataset (n=15, 50%). Summarization methodologies predominantly involved abstractive approaches (n=17, 57%) on single-document inputs (n=4, 13%) with unstructured data (n=13, 43%), yet reporting on methodological details remained inconsistent across studies. Model selection involved both open-source models (n=26, 87%) and proprietary models (n=7, 23%). Evaluation frameworks were highly heterogeneous. All studies conducted internal validation, but external validation (n=2, 7%), failure analysis (n=6, 20%), and patient safety risks analysis (n=1, 3%) were infrequent, and none reported bias assessment. Most studies used both automated metrics and human evaluation (n=16, 53%), while 10 (33%) used only automated metrics, and 4 (13%) only human evaluation. CONCLUSIONS Key barriers hinder the translation of current research into trustworthy, clinically valid applications. Current research remains exploratory and limited in scope, with many applications yet to be explored. Performance assessments often lack reliability, and clinical impact evaluations are insufficient raising concerns about model utility, safety, fairness, and data privacy. Advancing the field requires more robust evaluation frameworks, a broader research scope, and a stronger focus on real-world applicability.
Collapse
Affiliation(s)
- Lydie Bednarczyk
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
| | - Daniel Reichenpfader
- Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | | | - Amon Kenna Ette
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Jamil Zaghir
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Yuanyuan Zheng
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Adel Bensahla
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Mina Bjelogrlic
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christian Lovis
- Division of Medical Information Sciences, University Hospital of Geneva, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
2
|
Luo X, Tham YC, Giuffrè M, Ranisch R, Daher M, Lam K, Eriksen AV, Hsu CW, Ozaki A, Moraes FYD, Khanna S, Su KP, Begagić E, Bian Z, Chen Y, Estill J. Reporting guideline for the use of Generative Artificial intelligence tools in MEdical Research: the GAMER Statement. BMJ Evid Based Med 2025:bmjebm-2025-113825. [PMID: 40360239 DOI: 10.1136/bmjebm-2025-113825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/09/2025] [Indexed: 05/15/2025]
Abstract
OBJECTIVES Generative artificial intelligence (GAI) tools can enhance the quality and efficiency of medical research, but their improper use may result in plagiarism, academic fraud and unreliable findings. Transparent reporting of GAI use is essential, yet existing guidelines from journals and institutions are inconsistent, with no standardised principles. DESIGN AND SETTING International online Delphi study. PARTICIPANTS International experts in medicine and artificial intelligence. MAIN OUTCOME MEASURES The primary outcome measure is the consensus level of the Delphi expert panel on the items of inclusion criteria for GAMER (Rreporting guideline for the use of Generative Artificial intelligence tools in MEdical Research). RESULTS The development process included a scoping review, two Delphi rounds and virtual meetings. 51 experts from 26 countries participated in the process (44 in the Delphi survey). The final checklist comprises nine reporting items: general declaration, GAI tool specifications, prompting techniques, tool's role in the study, declaration of new GAI model(s) developed, artificial intelligence-assisted sections in the manuscript, content verification, data privacy and impact on conclusions. CONCLUSION GAMER provides universal and standardised guideline for GAI use in medical research, ensuring transparency, integrity and quality.
Collapse
Affiliation(s)
- Xufei Luo
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu, China
| | - Yih Chung Tham
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
- Ophthalmology and Visual Science Academic Clinical Program, Duke-NUS Medical School, Singapore
| | - Mauro Giuffrè
- Department of Internal Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, Connecticut, USA
- Department of Medical, Surgical, and Health Sciences, University of Trieste, Trieste, Italy
| | - Robert Ranisch
- Faculty of Health Sciences Brandenburg, University of Potsdam, Potsdam, Brandenburg, Germany
| | - Mohammad Daher
- Orthopedic department, Hôtel Dieu de France, Beirut, Lebanon
| | - Kyle Lam
- Department of Surgery and Cancer, Imperial College London, London, UK
| | | | - Che-Wei Hsu
- Department of Psychological Medicine, Dunedin School of Medicine, University of Otago, Dunedin, New Zealand
- Bachelor of Social Services, College of Community Development and Personal Wellbeing, Otago Polytechnic, Dunedin, New Zealand
| | - Akihiko Ozaki
- Jyoban Hospital of Tokiwa Foundation, Iwaki, Fukushima, Japan
| | | | - Sahil Khanna
- Gastroenterology and Hepatology, Mayo Clinic, Rochester, Minnesota, USA
| | - Kuan-Pin Su
- Mind-Body Interface Research Center (MBI-Lab), China Medical University Hospital, Taichung, Taiwan
- An-Nan Hospital, China Medical University, Tainan, Taiwan
| | - Emir Begagić
- Department of Neurosurgery, Cantonal Hospital Zenica, Zenica, Bosnia and Herzegovina
| | - Zhaoxiang Bian
- Vincent V.C. Woo Chinese Medicine Clinical Research Institute, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
- Chinese EQUATOR Centre, Hong Kong, China
| | - Yaolong Chen
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu, China
- Evidence-based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
- WHO Collaborating Centre for Guideline Implementation and Knowledge Translation, Lanzhou, China
| | - Janne Estill
- Evidence-based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
- Institute of Global Health, University of Geneva, Geneve, Switzerland
| |
Collapse
|
3
|
Chatziisaak D, Burri P, Sparn M, Hahnloser D, Steffen T, Bischofberger S. Concordance of ChatGPT artificial intelligence decision-making in colorectal cancer multidisciplinary meetings: retrospective study. BJS Open 2025; 9:zraf040. [PMID: 40331891 PMCID: PMC12056934 DOI: 10.1093/bjsopen/zraf040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Accepted: 02/14/2025] [Indexed: 05/08/2025] Open
Abstract
BACKGROUND The objective of this study was to evaluate the concordance between therapeutic recommendations proposed by a multidisciplinary team meeting and those generated by a large language model (ChatGPT) for colorectal cancer. Although multidisciplinary teams represent the 'standard' for decision-making in cancer treatment, they require significant resources and may be susceptible to human bias. Artificial intelligence, particularly large language models such as ChatGPT, has the potential to enhance or optimize the decision-making processes. The present study examines the potential for integrating artificial intelligence into clinical practice by comparing multidisciplinary team decisions with those generated by ChatGPT. METHODS A retrospective, single-centre study was conducted involving consecutive patients with newly diagnosed colorectal cancer discussed at our multidisciplinary team meeting. The pre- and post-therapeutic multidisciplinary team meeting recommendations were assessed for concordance compared with ChatGPT-4. RESULTS One hundred consecutive patients with newly diagnosed colorectal cancer of all stages were included. In the pretherapeutic discussions, complete concordance was observed in 72.5%, with partial concordance in 10.2% and discordance in 17.3%. For post-therapeutic discussions, the concordance increased to 82.8%; 11.8% of decisions displayed partial concordance and 5.4% demonstrated discordance. Discordance was more frequent in patients older than 77 years and with an American Society of Anesthesiologists classification ≥ III. CONCLUSION There is substantial concordance between the recommendations generated by ChatGPT and those provided by traditional multidisciplinary team meetings, indicating the potential utility of artificial intelligence in supporting clinical decision-making for colorectal cancer management.
Collapse
Affiliation(s)
- Dimitrios Chatziisaak
- Department of Surgery, Kantonsspital St. Gallen, St. Gallen, Switzerland
- Department of Surgery, Centre Hôpitalier Universitaire Vaudois, Lausanne, Switzerland
| | - Pascal Burri
- Department of Surgery, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Moritz Sparn
- Department of Surgery, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Dieter Hahnloser
- Department of Surgery, Centre Hôpitalier Universitaire Vaudois, Lausanne, Switzerland
| | - Thomas Steffen
- Department of Surgery, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | | |
Collapse
|
4
|
Harkos C, Hadjigeorgiou AG, Voutouri C, Kumar AS, Stylianopoulos T, Jain RK. Using mathematical modelling and AI to improve delivery and efficacy of therapies in cancer. Nat Rev Cancer 2025; 25:324-340. [PMID: 39972158 DOI: 10.1038/s41568-025-00796-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/30/2025] [Indexed: 02/21/2025]
Abstract
Mathematical modelling has proven to be a valuable tool in predicting the delivery and efficacy of molecular, antibody-based, nano and cellular therapy in solid tumours. Mathematical models based on our understanding of the biological processes at subcellular, cellular and tissue level are known as mechanistic models that, in turn, are divided into continuous and discrete models. Continuous models are further divided into lumped parameter models - for describing the temporal distribution of medicine in tumours and normal organs - and distributed parameter models - for studying the spatiotemporal distribution of therapy in tumours. Discrete models capture interactions at the cellular and subcellular levels. Collectively, these models are useful for optimizing the delivery and efficacy of molecular, nanoscale and cellular therapy in tumours by incorporating the biological characteristics of tumours, the physicochemical properties of drugs, the interactions among drugs, cancer cells and various components of the tumour microenvironment, and for enabling patient-specific predictions when combined with medical imaging. Artificial intelligence-based methods, such as machine learning, have ushered in a new era in oncology. These data-driven approaches complement mechanistic models and have immense potential for improving cancer detection, treatment and drug discovery. Here we review these diverse approaches and suggest ways to combine mechanistic and artificial intelligence-based models to further improve patient treatment outcomes.
Collapse
Affiliation(s)
- Constantinos Harkos
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Andreas G Hadjigeorgiou
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Chrysovalantis Voutouri
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Ashwin S Kumar
- Edwin L. Steele Laboratories, Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Triantafyllos Stylianopoulos
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus.
| | - Rakesh K Jain
- Edwin L. Steele Laboratories, Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
5
|
Demirbaş KC, Saygılı S, Yılmaz EK, Gülmez R, Ağbaş A, Taşdemir M, Canpolat N. The Potential of ChatGPT as a Source of Information for Kidney Transplant Recipients and Their Caregivers. Pediatr Transplant 2025; 29:e70068. [PMID: 40078030 DOI: 10.1111/petr.70068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 01/31/2025] [Accepted: 03/01/2025] [Indexed: 03/14/2025]
Abstract
BACKGROUND Education and enhancing the knowledge of adolescents who will undergo kidney transplantation are among the primary objectives of their care. While there are specific interventions in place to achieve this, they require extensive resources. The rise of large language models like ChatGPT-3.5 offers potential assistance for providing information to patients. This study aimed to evaluate the accuracy, relevance, and safety of ChatGPT-3.5's responses to patient-centered questions about pediatric kidney transplantation. The objective was to assess whether ChatGPT-3.5 could be a supplementary educational tool for adolescents and their caregivers in a complex medical context. METHODS A total of 37 questions about kidney transplantation were presented to ChatGPT-3.5, which was prompted to respond as a health professional would to a layperson. Five pediatric nephrologists independently evaluated the outputs for accuracy, relevance, comprehensiveness, understandability, readability, and safety. RESULTS The mean accuracy, relevancy, and comprehensiveness scores for all outputs were 4.51, 4.56, and 4.55, respectively. Out of 37 outputs, four were rated as completely accurate, and seven were completely relevant and comprehensive. Only one output had an accuracy, relevancy, and comprehensiveness score below 4. Twelve outputs were considered potentially risky, but only three had a risk grade of moderate or higher. Outputs that were considered risky had an accuracy and relevancy below the average. CONCLUSION Our findings suggest that ChatGPT could be a useful tool for adolescents or caregivers of individuals waiting for kidney transplantation. However, the presence of potentially risky outputs underscores the necessity for human oversight and validation.
Collapse
Affiliation(s)
- Kaan Can Demirbaş
- Department of Pediatrics, Istanbul University-Cerrahpaşa, Cerrahpaşa School of Medicine, Istanbul, Türkiye
| | - Seha Saygılı
- Division of Pediatric Nephrology, Department of Pediatrics, Istanbul University-Cerrahpaşa, Cerrahpaşa School of Medicine, Istanbul, Türkiye
| | - Esra Karabağ Yılmaz
- Division of Pediatric Nephrology, Department of Pediatrics, Istanbul University-Cerrahpaşa, Cerrahpaşa School of Medicine, Istanbul, Türkiye
| | - Rüveyda Gülmez
- Division of Pediatric Nephrology, Department of Pediatrics, Istanbul Prof. Dr. Suleyman Yalcin Research and Training Hospital, Istanbul, Türkiye
| | - Ayşe Ağbaş
- Division of Pediatric Nephrology, Department of Pediatrics, Istanbul University-Cerrahpaşa, Cerrahpaşa School of Medicine, Istanbul, Türkiye
| | - Mehmet Taşdemir
- Division of Pediatric Nephrology, Department of Pediatrics, Istinye University School of Medicine, Istanbul, Türkiye
| | - Nur Canpolat
- Division of Pediatric Nephrology, Department of Pediatrics, Istanbul University-Cerrahpaşa, Cerrahpaşa School of Medicine, Istanbul, Türkiye
| |
Collapse
|
6
|
Shahnam A, Nindra U, Hitchen N, Tang J, Hong M, Hong JH, Au-Yeung G, Chua W, Ng W, Hopkins AM, Sorich MJ. Application of Generative Artificial Intelligence for Physician and Patient Oncology Letters-AI-OncLetters. JCO Clin Cancer Inform 2025; 9:e2400323. [PMID: 40315407 DOI: 10.1200/cci-24-00323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2024] [Revised: 02/05/2025] [Accepted: 03/17/2025] [Indexed: 05/04/2025] Open
Abstract
PURPOSE Although large language models (LLMs) are increasingly used in clinical practice, formal assessments of their quality, accuracy, and effectiveness in medical oncology remain limited. We aimed to evaluate the ability of ChatGPT, an LLM, to generate physician and patient letters from clinical case notes. METHODS Six oncologists created 29 (four training, 25 final) synthetic oncology case notes. Structured prompts for ChatGPT were iteratively developed using the four training cases; once finalized, 25 physician-directed and patient-directed letters were generated. These underwent evaluation by expert consumers and oncologists for accuracy, relevance, and readability using Likert scales. The patient letters were also assessed with the Patient Education Materials Assessment Tool for Print (PEMAT-P), Flesch Reading Ease, and Simple Measure of Gobbledygook index. RESULTS Among physician-to-physician letters, 95% (119/125) of oncologists agreed they were accurate, comprehensive, and relevant, with no safety concerns noted. These letters demonstrated precise documentation of history, investigations, and treatment plans and were logically and concisely structured. Patient-directed letters achieved a mean Flesch Reading Ease score of 73.3 (seventh-grade reading level) and a PEMAT-P score above 80%, indicating high understandability. Consumer reviewers found them clear and appropriate for patient communication. Some omissions of details (eg, side effects), stylistic inconsistencies, and repetitive phrasing were identified, although no clinical safety issues emerged. Seventy-two percent (90/125) of consumers expressed willingness to receive artificial intelligence (AI)-generated patient letters. CONCLUSION ChatGPT, when guided by structured prompts, can generate high-quality letters that align with clinical and patient communication standards. No clinical safety concerns were identified, although addressing occasional omissions and improving natural language flow could enhance their utility in practice. Further studies comparing AI-generated and human-written letters are recommended.
Collapse
Affiliation(s)
- Adel Shahnam
- Department of Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
| | - Udit Nindra
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Nadia Hitchen
- Department of Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
| | - Joanne Tang
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Martin Hong
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Jun Hee Hong
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - George Au-Yeung
- Department of Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
- Sir Peter MacCallum Department of Oncology, The University of Melbourne, Melbourne, VIC, Australia
| | - Wei Chua
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Weng Ng
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Ashley M Hopkins
- Flinders Health and Medical Research Institute, College of Medicine and Public Health, Flinders University, Adelaide, SA, Australia
| | - Michael J Sorich
- Flinders Health and Medical Research Institute, College of Medicine and Public Health, Flinders University, Adelaide, SA, Australia
| |
Collapse
|
7
|
Sumner J, Wang Y, Tan SY, Chew EHH, Wenjun Yip A. Perspectives and Experiences With Large Language Models in Health Care: Survey Study. J Med Internet Res 2025; 27:e67383. [PMID: 40310666 PMCID: PMC12082058 DOI: 10.2196/67383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 01/14/2025] [Accepted: 01/15/2025] [Indexed: 05/02/2025] Open
Abstract
BACKGROUND Large language models (LLMs) are transforming how data is used, including within the health care sector. However, frameworks including the Unified Theory of Acceptance and Use of Technology highlight the importance of understanding the factors that influence technology use for successful implementation. OBJECTIVE This study aimed to (1) investigate users' uptake, perceptions, and experiences regarding LLMs in health care and (2) contextualize survey responses by demographics and professional profiles. METHODS An electronic survey was administered to elicit stakeholder perspectives of LLMs (health care providers and support functions), their experiences with LLMs, and their potential impact on functional roles. Survey domains included: demographics (6 questions), user experiences of LLMs (8 questions), motivations for using LLMs (6 questions), and perceived impact on functional roles (4 questions). The survey was launched electronically, targeting health care providers or support staff, health care students, and academics in health-related fields. Respondents were adults (>18 years) aware of LLMs. RESULTS Responses were received from 1083 individuals, of which 845 were analyzable. Of the 845 respondents, 221 had yet to use an LLM. Nonusers were more likely to be health care workers (P<.001), older (P<.001), and female (P<.01). Users primarily adopted LLMs for speed, convenience, and productivity. While 75% (470/624) agreed that the user experience was positive, 46% (294/624) found the generated content unhelpful. Regression analysis showed that the experience with LLMs is more likely to be positive if the user is male (odds ratio [OR] 1.62, CI 1.06-2.48), and increasing age was associated with a reduced likelihood of reporting LLM output as useful (OR 0.98, CI 0.96-0.99). Nonusers compared to LLM users were less likely to report LLMs meeting unmet needs (45%, 99/221 vs 65%, 407/624; OR 0.48, CI 0.35-0.65), and males were more likely to report that LLMs do address unmet needs (OR 1.64, CI 1.18-2.28). Furthermore, nonusers compared to LLM users were less likely to agree that LLMs will improve functional roles (63%, 140/221 vs 75%, 469/624; OR 0.60, CI 0.43-0.85). Free-text opinions highlighted concerns regarding autonomy, outperformance, and reduced demand for care. Respondents also predicted changes to human interactions, including fewer but higher quality interactions and a change in consumer needs as LLMs become more common, which would require provider adaptation. CONCLUSIONS Despite the reported benefits of LLMs, nonusers-primarily health care workers, older individuals, and females-appeared more hesitant to adopt these tools. These findings underscore the need for targeted education and support to address adoption barriers and ensure the successful integration of LLMs in health care. Anticipated role changes, evolving human interactions, and the risk of the digital divide further emphasize the need for careful implementation and ongoing evaluation of LLMs in health care to ensure equity and sustainability.
Collapse
Affiliation(s)
- Jennifer Sumner
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Yuchen Wang
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Si Ying Tan
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Emily Hwee Hoon Chew
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Alexander Wenjun Yip
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| |
Collapse
|
8
|
Weber MT, Noll R, Marchl A, Facchinello C, Grünewaldt A, Hügel C, Musleh K, Wagner TOF, Storf H, Schaaf J. MedBot vs RealDoc: efficacy of large language modeling in physician-patient communication for rare diseases. J Am Med Inform Assoc 2025; 32:775-783. [PMID: 39998911 PMCID: PMC12012358 DOI: 10.1093/jamia/ocaf034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 02/04/2025] [Accepted: 02/12/2025] [Indexed: 02/27/2025] Open
Abstract
OBJECTIVES This study assesses the abilities of 2 large language models (LLMs), GPT-4 and BioMistral 7B, in responding to patient queries, particularly concerning rare diseases, and compares their performance with that of physicians. MATERIALS AND METHODS A total of 103 patient queries and corresponding physician answers were extracted from EXABO, a question-answering forum dedicated to rare respiratory diseases. The responses provided by physicians and generated by LLMs were ranked on a Likert scale by a panel of 4 experts based on 4 key quality criteria for health communication: correctness, comprehensibility, relevance, and empathy. RESULTS The performance of generative pretrained transformer 4 (GPT-4) was significantly better than the performance of the physicians and BioMistral 7B. While the overall ranking considers GPT-4's responses to be mostly correct, comprehensive, relevant, and emphatic, the responses provided by BioMistral 7B were only partially correct and empathetic. The responses given by physicians rank in between. The experts concur that an LLM could lighten the load for physicians, rigorous validation is considered essential to guarantee dependability and efficacy. DISCUSSION Open-source models such as BioMistral 7B offer the advantage of privacy by running locally in health-care settings. GPT-4, on the other hand, demonstrates proficiency in communication and knowledge depth. However, challenges persist, including the management of response variability, the balancing of comprehensibility with medical accuracy, and the assurance of consistent performance across different languages. CONCLUSION The performance of GPT-4 underscores the potential of LLMs in facilitating physician-patient communication. However, it is imperative that these systems are handled with care, as erroneous responses have the potential to cause harm without the requisite validation procedures.
Collapse
Affiliation(s)
- Magdalena T Weber
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Richard Noll
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Alexandra Marchl
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | | | - Achim Grünewaldt
- Department of Respiratory Medicine and Allergology, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Christian Hügel
- HELIOS Dr Horst Schmidt Kliniken Wiesbaden, Klinik für Pneumologie, Wiesbaden 65199, Germany
| | - Khader Musleh
- Department of Respiratory Medicine and Allergology, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Thomas O F Wagner
- European Reference Network for Rare Respiratory Diseases (ERN-LUNG), University Medicine Frankfurt, Frankfurt 60590, Germany
| | - Holger Storf
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Jannik Schaaf
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| |
Collapse
|
9
|
Jabal MS, Warman P, Zhang J, Gupta K, Jain A, Mazurowski M, Wiggins W, Magudia K, Calabrese E. Open-Weight Language Models and Retrieval-Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports: Assessment of Approaches and Parameters. Radiol Artif Intell 2025; 7:e240551. [PMID: 40072216 DOI: 10.1148/ryai.240551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2025]
Abstract
Purpose To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weight language models (LMs) and retrieval-augmented generation (RAG) and to assess the effects of model configuration variables on extraction performance. Materials and Methods This retrospective study used two datasets: 7294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2154 pathology reports annotated for IDH mutation status (January 2017-July 2021). An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations for accuracy of structured data extraction from reports. The effect of model size, quantization, prompting strategies, output formatting, and inference parameters on model accuracy was systematically evaluated. Results The best-performing models achieved up to 98% accuracy in extracting BT-RADS scores from radiology reports and greater than 90% accuracy for extraction of IDH mutation status from pathology reports. The best model was medical fine-tuned Llama 3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models (mean accuracy, 86% vs 75%; P < .001). Model quantization had minimal effect on performance. Few-shot prompting significantly improved accuracy (mean [±SD] increase, 32% ± 32; P = .02). RAG improved performance for complex pathology reports by a mean of 48% ± 11 (P = .001) but not for shorter radiology reports (-8% ± 31; P = .39). Conclusion This study demonstrates the potential of open LMs in automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semiautomated optimization using annotated data are critical for optimal performance. Keywords: Large Language Models, Retrieval-Augmented Generation, Radiology, Pathology, Health Care Reports Supplemental material is available for this article. © RSNA, 2025 See also commentary by Tejani and Rauschecker in this issue.
Collapse
Affiliation(s)
- Mohamed Sobhi Jabal
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | | | - Jikai Zhang
- Department of Electrical and Computer Engineering, Duke University, Durham, NC
- Duke Center for Artificial Intelligence in Radiology, Duke University, Durham, NC
| | - Kartikeye Gupta
- Department of Radiology, Duke University Medical Center, Durham, NC
| | - Ayush Jain
- Department of Radiology, Duke University Medical Center, Durham, NC
| | - Maciej Mazurowski
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
- Duke University School of Medicine, Durham, NC
- Department of Electrical and Computer Engineering, Duke University, Durham, NC
| | - Walter Wiggins
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | - Kirti Magudia
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | - Evan Calabrese
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
- Department of Radiology, Duke University Medical Center, Durham, NC
| |
Collapse
|
10
|
Neveditsin N, Lingras P, Mago V. Clinical insights: A comprehensive review of language models in medicine. PLOS DIGITAL HEALTH 2025; 4:e0000800. [PMID: 40338967 PMCID: PMC12061104 DOI: 10.1371/journal.pdig.0000800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Accepted: 02/25/2025] [Indexed: 05/10/2025]
Abstract
This paper explores the advancements and applications of language models in healthcare, focusing on their clinical use cases. It examines the evolution from early encoder-based systems requiring extensive fine-tuning to state-of-the-art large language and multimodal models capable of integrating text and visual data through in-context learning. The analysis emphasizes locally deployable models, which enhance data privacy and operational autonomy, and their applications in tasks such as text generation, classification, information extraction, and conversational systems. The paper also highlights a structured organization of tasks and a tiered ethical approach, providing a valuable resource for researchers and practitioners, while discussing key challenges related to ethics, evaluation, and implementation.
Collapse
Affiliation(s)
- Nikita Neveditsin
- Department of Mathematics and Computing Science, Saint Mary’s University, Halifax, Nova Scotia, Canada
| | - Pawan Lingras
- Department of Mathematics and Computing Science, Saint Mary’s University, Halifax, Nova Scotia, Canada
| | - Vijay Mago
- School of Health Policy and Management, York University, Toronto, Ontario, Canada
| |
Collapse
|
11
|
Gallano G, Giglio A, Ferre A. Artificial Intelligence in Speech-Language Pathology and Dysphagia: A Review From Latin American Perspective and Pilot Test of LLMs for Rehabilitation Planning. J Voice 2025:S0892-1997(25)00158-4. [PMID: 40312192 DOI: 10.1016/j.jvoice.2025.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 03/02/2025] [Accepted: 04/10/2025] [Indexed: 05/03/2025]
Abstract
Artificial Intelligence (AI) is transforming speech-language pathology (SLP) and dysphagia management, offering innovative solutions for assessment, diagnosis, and rehabilitation. This narrative review examines AI applications in these fields from 2014 to 2024, with particular focus on implementation challenges in Latin America. We analyze key AI technologies-including deep learning, machine learning algorithms, and natural language processing-that have demonstrated high accuracy in detecting voice disorders, analyzing swallowing function, and supporting personalized rehabilitation. The review identifies three primary domains of AI application: diagnostic tools with improved sensitivity for speech-language disorders, rehabilitation technologies that enable customized therapy, and telehealth platforms that expand access to specialized care in underserved regions. However, significant barriers persist, particularly in Latin America, where limited infrastructure, insufficient linguistic adaptation, and scarce regional datasets hamper widespread implementation. Our pilot study evaluating commercially available large language models for rehabilitation planning demonstrates their potential utility in generating structured therapy activities, especially in resource-constrained settings. While AI shows promise in enhancing clinical workflows and expanding service delivery, the evidence suggests that current applications remain predominantly focused on diagnosis rather than integrated rehabilitation. This review highlights the need for culturally and linguistically adapted AI models, expanded regional research collaborations, and regulatory frameworks that ensure ethical AI integration into SLP and dysphagia care, positioning these technologies as complementary tools that enhance rather than replace clinical expertise.
Collapse
Affiliation(s)
| | - Andres Giglio
- Critical Care Department, Clinica Las Condes Hospital, Santiago, Chile; Critical Care Department, Finis Terrae University, Santiago, Chile.
| | - Andres Ferre
- Critical Care Department, Clinica Las Condes Hospital, Santiago, Chile; Critical Care Department, Finis Terrae University, Santiago, Chile
| |
Collapse
|
12
|
Shan G, Chen X, Wang C, Liu L, Gu Y, Jiang H, Shi T. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. JMIR Med Inform 2025; 13:e64963. [PMID: 40279517 PMCID: PMC12047852 DOI: 10.2196/64963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 03/19/2025] [Accepted: 03/25/2025] [Indexed: 04/27/2025] Open
Abstract
Background With the rapid development of artificial intelligence (AI) technology, especially generative AI, large language models (LLMs) have shown great potential in the medical field. Through massive medical data training, it can understand complex medical texts and can quickly analyze medical records and provide health counseling and diagnostic advice directly, especially in rare diseases. However, no study has yet compared and extensively discussed the diagnostic performance of LLMs with that of physicians. Objective This study systematically reviewed the accuracy of LLMs in clinical diagnosis and provided reference for further clinical application. Methods We conducted searches in CNKI (China National Knowledge Infrastructure), VIP Database, SinoMed, PubMed, Web of Science, Embase, and CINAHL (Cumulative Index to Nursing and Allied Health Literature) from January 1, 2017, to the present. A total of 2 reviewers independently screened the literature and extracted relevant information. The risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST), which evaluates both the risk of bias and the applicability of included studies. Results A total of 30 studies involving 19 LLMs and a total of 4762 cases were included. The quality assessment indicated a high risk of bias in the majority of studies, primary cause is known case diagnosis. For the optimal model, the accuracy of the primary diagnosis ranged from 25% to 97.8%, while the triage accuracy ranged from 66.5% to 98%. Conclusions LLMs have demonstrated considerable diagnostic capabilities and significant potential for application across various clinical cases. Although their accuracy still falls short of that of clinical professionals, if used cautiously, they have the potential to become one of the best intelligent assistants in the field of human health care.
Collapse
Affiliation(s)
- Guxue Shan
- Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, China
| | - Xiaonan Chen
- Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, China
| | - Chen Wang
- Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, China
| | - Li Liu
- Jiangsu Province Hospital of Chinese Medicine, Affiliated Hospital of Nanjing University of Chinese Medicine, Nanjing, China
| | - Yuanjing Gu
- Department of Emergency, Nanjing Drum Tower Hospital, Nanjing, China
| | - Huiping Jiang
- Department of Nursing, Nanjing Drum Tower Hospital, Nanjing, China
| | - Tingqi Shi
- Department of Quality Management, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, 321 Zhongshan Road, Gulou District, Nanjing, 210008, China, 86 1-391-299-6998
| |
Collapse
|
13
|
Kim JK, Chua ME, Li TG, Rickard M, Lorenzo AJ. Novel AI applications in systematic review: GPT-4 assisted data extraction, analysis, review of bias. BMJ Evid Based Med 2025:bmjebm-2024-113066. [PMID: 40199559 DOI: 10.1136/bmjebm-2024-113066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/17/2025] [Indexed: 04/10/2025]
Abstract
OBJECTIVE To assess custom GPT-4 performance in extracting and evaluating data from medical literature to assist in the systematic review (SR) process. DESIGN A proof-of-concept comparative study was conducted to assess the accuracy and precision of custom GPT-4 models against human-performed reviews of randomised controlled trials (RCTs). SETTING Four custom GPT-4 models were developed, each specialising in one of the following areas: (1) extraction of study characteristics, (2) extraction of outcomes, (3) extraction of bias assessment domains and (4) evaluation of risk of bias using results from the third GPT-4 model. Model outputs were compared against data from four SRs conducted by human authors. The evaluation focused on accuracy in data extraction, precision in replicating outcomes and agreement levels in risk of bias assessments. PARTICIPANTS Among four SRs chosen, 43 studies were retrieved for data extraction evaluation. Additionally, 17 RCTs were selected for comparison of risk of bias assessments, where both human comparator SRs and an analogous SR provided assessments for comparison. INTERVENTION Custom GPT-4 models were deployed to extract data and evaluate risk of bias from selected studies, and their outputs were compared to those generated by human reviewers. MAIN OUTCOME MEASURES Concordance rates between GPT-4 outputs and human-performed SRs in data extraction, effect size comparability and inter/intra-rater agreement in risk of bias assessments. RESULTS When comparing the automatically extracted data to the first table of study characteristics from the published review, GPT-4 showed 88.6% concordance with the original review, with <5% discrepancies due to inaccuracies or omissions. It exceeded human accuracy in 2.5% of instances. Study outcomes were extracted and pooling of results showed comparable effect sizes to comparator SRs. A review of bias assessment using GPT-4 showed fair-moderate but significant intra-rater agreement (ICC=0.518, p<0.001) and inter-rater agreements between human comparator SR (weighted kappa=0.237) and the analogous SR (weighted kappa=0.296). In contrast, there was a poor agreement between the two human-performed SRs (weighted kappa=0.094). CONCLUSION Customized GPT-4 models perform well in extracting precise data from medical literature with potential for utilization in review of bias. While the evaluated tasks are simpler than the broader range of SR methodologies, they provide an important initial assessment of GPT-4's capabilities.
Collapse
Affiliation(s)
- Jin Kyu Kim
- Department of Surgery, The Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Surgery, University of Toronto, Toronto, Ontario, Canada
- Urology, Riley Hospital for Children, Indianapolis, Indiana, USA
| | - Michael Erlano Chua
- Department of Surgery, The Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Surgery, University of Toronto, Toronto, Ontario, Canada
| | - Tian Ge Li
- Department of Surgery, University of Toronto, Toronto, Ontario, Canada
| | - Mandy Rickard
- Department of Surgery, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Armando J Lorenzo
- Department of Surgery, The Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Surgery, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
14
|
Resch B, Kolokoussis P, Hanny D, Brovelli MA, Kamel Boulos MN. The generative revolution: AI foundation models in geospatial health-applications, challenges and future research. Int J Health Geogr 2025; 24:6. [PMID: 40176078 PMCID: PMC11966900 DOI: 10.1186/s12942-025-00391-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Accepted: 03/04/2025] [Indexed: 04/04/2025] Open
Abstract
In an era of rapid technological advancements, generative artificial intelligence and foundation models are reshaping industries and offering new advanced solutions in a wide range of scientific areas, particularly in public and environmental health. However, foundation models have previously mostly focused on understanding and generating text, while geospatial features, interrelations, flows and correlations have been neglected. Thus, this paper outlines the importance of research into Geospatial Foundation Models, which have the potential to revolutionise digital health surveillance and public health. We examine the latest advances, opportunities, challenges, and ethical considerations of geospatial foundation models for research and applications in digital health. We focus on the specific challenges of integrating geospatial context with foundation models and lay out the future potential for multimodal geospatial foundation models for a variety of research avenues in digital health surveillance and health assessment.
Collapse
Affiliation(s)
- Bernd Resch
- IT:U Interdisciplinary Transformation University, 4040, Linz, Austria
- Center for Geographic Analysis, Harvard University, Cambridge, MA, 02138, USA
| | - Polychronis Kolokoussis
- School of Rural, Surveying & Geoinformatics Engineering, National Technical University of Athens, 15780, Athens, Greece
| | - David Hanny
- IT:U Interdisciplinary Transformation University, 4040, Linz, Austria
| | - Maria Antonia Brovelli
- Department of Civil and Environmental Engineering, Politecnico Di Milano, 20133, Milan, Italy
| | | |
Collapse
|
15
|
Jung KH. Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc Inform Res 2025; 31:114-124. [PMID: 40384063 PMCID: PMC12086438 DOI: 10.4258/hir.2025.31.2.114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2025] [Accepted: 04/23/2025] [Indexed: 05/20/2025] Open
Abstract
OBJECTIVES This study presents a comprehensive review of the clinical applications, technical challenges, and ethical considerations associated with using large language models (LLMs) in medicine. METHODS A literature survey of peer-reviewed articles, technical reports, and expert commentary from relevant medical and artificial intelligence journals was conducted. Key clinical application areas, technical limitations (e.g., accuracy, validation, transparency), and ethical issues (e.g., bias, safety, accountability, privacy) were identified and analyzed. RESULTS LLMs have potential in clinical documentation assistance, decision support, patient communication, and workflow optimization. The level of supporting evidence varies; documentation support applications are relatively mature, whereas autonomous diagnostics continue to face notable limitations regarding accuracy and validation. Key technical challenges include model hallucination, lack of robust clinical validation, integration issues, and limited transparency. Ethical concerns involve algorithmic bias risking health inequities, threats to patient safety from inaccuracies, unclear accountability, data privacy, and impacts on clinician-patient interactions. CONCLUSIONS LLMs possess transformative potential for clinical medicine, particularly by augmenting clinician capabilities. However, substantial technical and ethical hurdles necessitate rigorous research, validation, clearly defined guidelines, and human oversight. Existing evidence supports an assistive rather than autonomous role, mandating careful, evidence-based integration that prioritizes patient safety and equity.
Collapse
Affiliation(s)
- Kyu-Hwan Jung
- Department of Medical Device Management and Research, Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University, Seoul,
Korea
- Smart Healthcare Research Institute, Research Institute for Future Medicine, Samsung Medical Center, Seoul,
Korea
| |
Collapse
|
16
|
Umesh C, Mahendra M, Bej S, Wolkenhauer O, Wolfien M. Challenges and applications in generative AI for clinical tabular data in physiology. Pflugers Arch 2025; 477:531-542. [PMID: 39417878 PMCID: PMC11958401 DOI: 10.1007/s00424-024-03024-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 09/17/2024] [Accepted: 09/23/2024] [Indexed: 10/19/2024]
Abstract
Recent advancements in generative approaches in AI have opened up the prospect of synthetic tabular clinical data generation. From filling in missing values in real-world data, these approaches have now advanced to creating complex multi-tables. This review explores the development of techniques capable of synthesizing patient data and modeling multiple tables. We highlight the challenges and opportunities of these methods for analyzing patient data in physiology. Additionally, it discusses the challenges and potential of these approaches in improving clinical research, personalized medicine, and healthcare policy. The integration of these generative models into physiological settings may represent both a theoretical advancement and a practical tool that has the potential to improve mechanistic understanding and patient care. By providing a reliable source of synthetic data, these models can also help mitigate privacy concerns and facilitate large-scale data sharing.
Collapse
Affiliation(s)
- Chaithra Umesh
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany.
| | - Manjunath Mahendra
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany.
| | - Saptarshi Bej
- School of Data Science, Indian Institute of Science Education and Research (IISER), Thiruvananthapuram, India
| | - Olaf Wolkenhauer
- Institute of Computer Science, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany
- Leibniz-Institute for Food Systems Biology, Technical University of Munich, Freising, Germany
| | - Markus Wolfien
- Faculty of Medicine Carl Gustav Carus, Institute for Medical Informatics and Biometry, TUD Dresden University of Technology, Dresden, Germany
- Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Dresden, Germany
| |
Collapse
|
17
|
Shoorgashti R, Alimohammadi M, Baghizadeh S, Radmard B, Ebrahimi H, Lesan S. Artificial Intelligence Models Accuracy for Odontogenic Keratocyst Detection From Panoramic View Radiographs: A Systematic Review and Meta-Analysis. Health Sci Rep 2025; 8:e70614. [PMID: 40165928 PMCID: PMC11956212 DOI: 10.1002/hsr2.70614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2024] [Revised: 02/04/2025] [Accepted: 03/08/2025] [Indexed: 04/02/2025] Open
Abstract
Background and Aims Odontogenic keratocyst (OKC) is a radiolucent jaw lesion often mistaken for similar conditions like ameloblastomas on panoramic radiographs. Accurate diagnosis is vital for effective management, but manual image interpretation can be inconsistent. While deep learning algorithms in AI have shown promise in improving diagnostic accuracy for OKCs, their performance across studies is still unclear. This systematic review and meta-analysis aimed to evaluate the diagnostic accuracy of AI models in detecting OKC from panoramic radiographs. Methods A systematic search was performed across 5 databases. Studies were included if they examined the PICO question of whether AI models (I) could improve the diagnostic accuracy (O) of OKC in panoramic radiographs (P) compared to reference standards (C). Key performance metrics including sensitivity, specificity, accuracy, and area under the curve (AUC) were extracted and pooled using random-effects models. Meta-regression and subgroup analyses were conducted to identify sources of heterogeneity. Publication bias was evaluated through funnel plots and Egger's test. Results Eight studies were included in the meta-analysis. The pooled sensitivity across all studies was 83.66% (95% CI:73.75%-93.57%) and specificity was 82.89% (95% CI:70.31%-95.47%). YOLO-based models demonstrated superior diagnostic performance with a sensitivity of 96.4% and specificity of 96.0%, compared to other architectures. Meta-regression analysis indicated that model architecture was a significant predictor of diagnostic performance, accounting for a significant portion of the observed heterogeneity. However, the analysis also revealed publication bias and high variability across studies (Egger's test, p = 0.042). Conclusion AI models, particularly YOLO-based architectures, can improve the diagnostic accuracy of OKCs in panoramic radiographs. While AI shows strong capabilities in simple cases, it should complement, not replace, human expertise, especially in complex situations.
Collapse
Affiliation(s)
- Reyhaneh Shoorgashti
- Department of Oral and Maxillofacial Medicine, School of DentistryIslamic Azad University of Medical SciencesTehranIran
| | | | - Sana Baghizadeh
- Faculty of Dentistry, Tehran Medical SciencesIslamic Azad UniversityTehranIran
| | - Bahareh Radmard
- School of DentistryShahid Beheshti University of Medical SciencesTehranIran
| | - Hooman Ebrahimi
- Department of Oral and Maxillofacial Medicine, School of DentistryIslamic Azad University of Medical SciencesTehranIran
| | - Simin Lesan
- Department of Oral and Maxillofacial Medicine, School of DentistryIslamic Azad University of Medical SciencesTehranIran
| |
Collapse
|
18
|
Lu J, Choi K, Eremeev M, Gobburu J, Goswami S, Liu Q, Mo G, Musante CJ, Shahin MH. Large Language Models and Their Applications in Drug Discovery and Development: A Primer. Clin Transl Sci 2025; 18:e70205. [PMID: 40208836 PMCID: PMC11984503 DOI: 10.1111/cts.70205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Revised: 02/21/2025] [Accepted: 03/10/2025] [Indexed: 04/12/2025] Open
Abstract
Large language models (LLMs) have emerged as powerful tools in many fields, including clinical pharmacology and translational medicine. This paper aims to provide a comprehensive primer on the applications of LLMs to these disciplines. We will explore the fundamental concepts of LLMs, their potential applications in drug discovery and development processes ranging from facilitating target identification to aiding preclinical research and clinical trial analysis, and practical use cases such as assisting with medical writing and accelerating analytical workflows in quantitative clinical pharmacology. By the end of this paper, clinical pharmacologists and translational scientists will have a clearer understanding of how to leverage LLMs to enhance their research and development efforts.
Collapse
Affiliation(s)
- James Lu
- Clinical PharmacologyGenentech Inc.South San FranciscoCaliforniaUSA
| | - Keunwoo Choi
- Prescient DesignGenentech Inc.South San FranciscoCaliforniaUSA
| | - Maksim Eremeev
- Prescient DesignGenentech Inc.South San FranciscoCaliforniaUSA
| | - Jogarao Gobburu
- University of Maryland School of PharmacyBaltimoreMarylandUSA
| | | | - Qi Liu
- Office of Clinical PharmacologyCenter for Drug Evaluation and Research, U.S. FDASilver SpringsMarylandUSA
| | - Gary Mo
- Pfizer Research & DevelopmentCurrently at Eli Lilly and CompanyIndianapolisIndianaUSA
| | | | | |
Collapse
|
19
|
Guo L, Zuo Y, Yisha Z, Liu J, Gu A, Yushan R, Liu G, Li S, Liu T, Wang X. Diagnostic performance of advanced large language models in cystoscopy: evidence from a retrospective study and clinical cases. BMC Urol 2025; 25:64. [PMID: 40158093 PMCID: PMC11954320 DOI: 10.1186/s12894-025-01740-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Accepted: 03/11/2025] [Indexed: 04/01/2025] Open
Abstract
PURPOSE To evaluate the diagnostic capabilities of advanced large language models (LLMs) in interpreting cystoscopy images for the identification of common urological conditions. MATERIALS AND METHODS A retrospective analysis was conducted on 603 cystoscopy images obtained from 101 procedures. Two advanced LLMs, both at the forefront of artificial intelligence technology, were employed to interpret these images. The diagnostic interpretations generated by these LLMs were systematically compared against standard clinical diagnostic assessments. The study's primary outcome measure was the overall diagnostic accuracy of the LLMs. Secondary outcomes focused on evaluating condition-specific accuracies across various urological conditions. RESULTS The combined diagnostic accuracy of both LLMs was 89.2%, with ChatGPT-4 V and Claude 3.5 Sonnet achieving accuracies of 82.8% and 79.8%, respectively. Condition-specific accuracies varied considerably, for specific urological disorders: bladder tumors (ChatGPT-4 V: 92.2%, Claude 3.5 Sonnet: 80.9%), BPH (35.3%, 32.4%), cystitis (94.5%, 98.9%), bladder diverticula (92.3%, 53.8%), and bladder trabeculae (55.8%, 59.6%). As for normal anatomical structures: ureteral orifice (ChatGPT-4 V: 48.8%, Claude 3.5 Sonnet: 61.0%), bladder neck (97.9%, 93.8%), and prostatic urethra (64.3%,57.1%). CONCLUSIONS Advanced language models demonstrated varying levels of diagnostic accuracy in cystoscopy image interpretation, excelling in cystitis detection while showing lower accuracy for other conditions, notably benign prostatic hyperplasia. These findings suggest promising potential for LLMs as supportive tools in urological diagnosis, particularly for urologists in training or early career stages. This study underscores the need for continued research and development to optimize these AI-driven tools, with the ultimate goal of improving diagnostic accuracy and efficiency in urological practice. CLINICAL TRIAL NUMBER Not applicable.
Collapse
Affiliation(s)
- Linfa Guo
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Yingtong Zuo
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Zuhaer Yisha
- Department of Epidemiology and Biostatistics, School of Public Health, Peking University, Beijing, China
| | - Jiuling Liu
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Aodun Gu
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Refate Yushan
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Guiyong Liu
- Department of Urology, Qianjiang Central Hospital of Hubei Province, Qianjiang, China
| | - Sheng Li
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
- Hubei Key Laboratory of Urological Diseases, Wuhan University, Wuhan, China
- Hubei Medical Quality Control Center for Laparoscopic/Endoscopic Urologic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China
- Wuhan Clinical Research Center for Urogenital Tumors, Zhongnan Hospital of Wuhan University, Wuhan, China
- Cancer Precision Diagnosis and Treatment and Translational Medicine Hubei Engineering Research Center, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Tongzu Liu
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Hubei Key Laboratory of Urological Diseases, Wuhan University, Wuhan, China.
- Hubei Clinical Research Center for Laparoscopic/Endoscopic Urologic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Institute of Urology, Wuhan University, Wuhan, China.
- Hubei Medical Quality Control Center for Laparoscopic/Endoscopic Urologic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Wuhan Clinical Research Center for Urogenital Tumors, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Cancer Precision Diagnosis and Treatment and Translational Medicine Hubei Engineering Research Center, Zhongnan Hospital of Wuhan University, Wuhan, China.
| | - Xiaolong Wang
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Hubei Key Laboratory of Urological Diseases, Wuhan University, Wuhan, China.
- Institute of Urology, Wuhan University, Wuhan, China.
| |
Collapse
|
20
|
Mondal H. Integration of Large Language Models as an Adjunct Tool in Healthcare. Turk Arch Otorhinolaryngol 2025; 62:174-175. [PMID: 40152484 PMCID: PMC11977009 DOI: 10.4274/tao.2024.2024-10-11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2024] [Accepted: 11/26/2024] [Indexed: 03/29/2025] Open
Affiliation(s)
- Himel Mondal
- All India Institute of Medical Sciences Department of Physiology, Jharkhand, India
| |
Collapse
|
21
|
Menz BD, Modi ND, Abuhelwa AY, Ruanglertboon W, Vitry A, Gao Y, Li LX, Chhetri R, Chu B, Bacchi S, Kichenadasse G, Shahnam A, Rowland A, Sorich MJ, Hopkins AM. Generative AI chatbots for reliable cancer information: Evaluating web-search, multilingual, and reference capabilities of emerging large language models. Eur J Cancer 2025; 218:115274. [PMID: 39922126 DOI: 10.1016/j.ejca.2025.115274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2025] [Revised: 01/22/2025] [Accepted: 01/24/2025] [Indexed: 02/10/2025]
Abstract
Recent advancements in large language models (LLMs) enable real-time web search, improved referencing, and multilingual support, yet ensuring they provide safe health information remains crucial. This perspective evaluates seven publicly accessible LLMs-ChatGPT, Co-Pilot, Gemini, MetaAI, Claude, Grok, Perplexity-on three simple cancer-related queries across eight languages (336 responses: English, French, Chinese, Thai, Hindi, Nepali, Vietnamese, and Arabic). None of the 42 English responses contained clinically meaningful hallucinations, whereas 7 of 294 non-English responses did. 48 % (162/336) of responses included valid references, but 39 % of the English references were.com links reflecting quality concerns. English responses frequently exceeded an eighth-grade level, and many non-English outputs were also complex. These findings reflect substantial progress over the past 2-years but reveal persistent gaps in multilingual accuracy, reliable reference inclusion, referral practices, and readability. Ongoing benchmarking is essential to ensure LLMs safely support global health information dichotomy and meet online information standards.
Collapse
Affiliation(s)
- Bradley D Menz
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Natansh D Modi
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Ahmad Y Abuhelwa
- Department of Pharmacy Practice and Pharmacotherapeutics, College of Pharmacy, University of Sharjah, Sharjah, United Arab Emirates
| | - Warit Ruanglertboon
- Division of Health and Applied Sciences, Prince of Songkla University, Songkhla, Thailand; Research Center in Mathematics and Statistics with Applications, Discipline of Statistics, Division of Computational Science, Faculty of Science, Prince of Songkla University, Songkhla, Thailand
| | - Agnes Vitry
- University of South Australia, Clinical and Health Sciences, Adelaide, Australia
| | - Yuan Gao
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Lee X Li
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Rakchha Chhetri
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Bianca Chu
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Stephen Bacchi
- Department of Neurology and the Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02138, USA
| | - Ganessan Kichenadasse
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia; Flinders Centre for Innovation in Cancer, Department of Medical Oncology, Flinders Medical Centre, Flinders University, Bedford Park, South Australia, Australia
| | - Adel Shahnam
- Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, Australia
| | - Andrew Rowland
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Michael J Sorich
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Ashley M Hopkins
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia.
| |
Collapse
|
22
|
Woo JJ, Yang AJ, Olsen RJ, Hasan SS, Nawabi DH, Nwachukwu BU, Williams RJ, Ramkumar PN. Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine. Arthroscopy 2025; 41:565-573.e6. [PMID: 39521391 DOI: 10.1016/j.arthro.2024.10.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/21/2024] [Revised: 10/27/2024] [Accepted: 10/27/2024] [Indexed: 11/16/2024]
Abstract
PURPOSE To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case. METHODS A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response. RESULTS All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI's GPT4 (95%). CONCLUSIONS RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis. CLINICAL RELEVANCE Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.
Collapse
Affiliation(s)
- Joshua J Woo
- Brown University/The Warren Alpert School of Brown University, Providence, Rhode Island, U.S.A
| | - Andrew J Yang
- Brown University/The Warren Alpert School of Brown University, Providence, Rhode Island, U.S.A
| | - Reena J Olsen
- Tufts University School of Medicine, Boston, Massachusetts, U.S.A
| | | | | | | | | | | |
Collapse
|
23
|
Giacobbe DR, Marelli C, La Manna B, Padua D, Malva A, Guastavino S, Signori A, Mora S, Rosso N, Campi C, Piana M, Murgia Y, Giacomini M, Bassetti M. Advantages and limitations of large language models for antibiotic prescribing and antimicrobial stewardship. NPJ ANTIMICROBIALS AND RESISTANCE 2025; 3:14. [PMID: 40016394 PMCID: PMC11868396 DOI: 10.1038/s44259-025-00084-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 02/06/2025] [Indexed: 03/01/2025]
Abstract
Antibiotic prescribing requires balancing optimal treatment for patients with reducing antimicrobial resistance. There is a lack of standardization in research on using large language models (LLMs) for supporting antibiotic prescribing, necessitating more efforts to identify biases and misinformation in their outputs. Educating future medical professionals on these aspects is crucial for ensuring the proper use of LLMs for supporting antibiotic prescribing, providing a deeper understanding of their strengths and limitations.
Collapse
Affiliation(s)
- Daniele Roberto Giacobbe
- Department of Health Sciences (DISSAL), University of Genoa, Genoa, Italy.
- UO Clinica Malattie Infettive, IRCCS Ospedale Policlinico San Martino, Genoa, Italy.
| | - Cristina Marelli
- UO Clinica Malattie Infettive, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Bianca La Manna
- Department of Informatics, Bioengineering, Robotics and System Engineering (DIBRIS), University of Genoa, Genoa, Italy
| | - Donatella Padua
- Departmental Faculty of Medicine, UniCamillus - International University of Health and Medical Science, Rome, Italy
| | - Alberto Malva
- Italian Interdisciplinary Society for Primary Care, Bari, Italy
| | | | - Alessio Signori
- Section of Biostatistics, Department of Health Sciences (DISSAL), University of Genoa, Genoa, Italy
- IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Sara Mora
- UO Information and Communication Technologies, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Nicola Rosso
- UO Information and Communication Technologies, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Cristina Campi
- Department of Mathematics (DIMA), University of Genoa, Genoa, Italy
- Life Science Computational Laboratory (LISCOMP), IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Michele Piana
- Department of Mathematics (DIMA), University of Genoa, Genoa, Italy
- Life Science Computational Laboratory (LISCOMP), IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Ylenia Murgia
- Department of Informatics, Bioengineering, Robotics and System Engineering (DIBRIS), University of Genoa, Genoa, Italy
| | - Mauro Giacomini
- Department of Informatics, Bioengineering, Robotics and System Engineering (DIBRIS), University of Genoa, Genoa, Italy
| | - Matteo Bassetti
- Department of Health Sciences (DISSAL), University of Genoa, Genoa, Italy
- UO Clinica Malattie Infettive, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| |
Collapse
|
24
|
Yin S, Huang S, Xue P, Xu Z, Lian Z, Ye C, Ma S, Liu M, Hu Y, Lu P, Li C. Generative artificial intelligence (GAI) usage guidelines for scholarly publishing: a cross-sectional study of medical journals. BMC Med 2025; 23:77. [PMID: 39934830 PMCID: PMC11816781 DOI: 10.1186/s12916-025-03899-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Accepted: 01/23/2025] [Indexed: 02/13/2025] Open
Abstract
BACKGROUND Generative artificial intelligence (GAI) has developed rapidly and been increasingly used in scholarly publishing, so it is urgent to examine guidelines for its usage. This cross-sectional study aims to examine the coverage and type of recommendations of GAI usage guidelines among medical journals and how these factors relate to journal characteristics. METHODS From the SCImago Journal Rank (SJR) list for medicine in 2022, we generated two groups of journals: top SJR ranked journals (N = 200) and random sample of non-top SJR ranked journals (N = 140). For each group, we examined the coverage of author and reviewer guidelines across four categories: no guidelines, external guidelines only, own guidelines only, and own and external guidelines. We then calculated the number of recommendations by counting the number of usage recommendations for author and reviewer guidelines separately. Regression models examined the relationship of journal characteristics with the coverage and type of recommendations of GAI usage guidelines. RESULTS A higher proportion of top SJR ranked journals provided author guidelines compared to the random sample of non-top SJR ranked journals (95.0% vs. 86.7%, P < 0.01). The two groups of journals had the same median of 5 on a scale of 0 to 7 for author guidelines and a median of 1 on a scale of 0 to 2 for reviewer guidelines. However, both groups had lower percentages of journals providing recommendations for data analysis and interpretation, with the random sample of non-top SJR ranked journals having a significantly lower percentage (32.5% vs. 16.7%, P < 0.05). A higher SJR score was positively associated with providing GAI usage guidelines for both authors (all P < 0.01) and reviewers (all P < 0.01) among the random sample of non-top SJR ranked journals. CONCLUSIONS Although most medical journals provided their own GAI usage guidelines or referenced external guidelines, some recommendations remained unspecified (e.g., whether AI can be used for data analysis and interpretation). Additionally, journals with lower SJR scores were less likely to provide guidelines, indicating a potential gap that warrants attention. Collaborative efforts are needed to develop specific recommendations that better guide authors and reviewers.
Collapse
Affiliation(s)
- Shuhui Yin
- Applied Linguistics & Technology, Department of English, Iowa State University, Ames, IA, USA
| | - Simu Huang
- Center for Data Science, Zhejiang University, Hangzhou, Zhejiang, China
| | - Peng Xue
- Institute of Chinese Medical Sciences, University of Macau, Zhuhai, Macao SAR, China
- Centre for Pharmaceutical Regulatory Sciences, University of Macau, Zhuhai, Macao SAR, China
- Faculty of Health Sciences, University of Macau, Zhuhai, Macao SAR, China
| | - Zhuoran Xu
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Zi Lian
- Center for Health Equity & Urban Science Education, Teachers College, Columbia University, New York, NY, USA
| | - Chenfei Ye
- International Research Institute for Artificial Intelligence, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Siyuan Ma
- Department of Communication, University of Macau, Zhuhai, Macao SAR, China
| | - Mingxuan Liu
- Department of Communication, University of Macau, Zhuhai, Macao SAR, China
| | - Yuanjia Hu
- Institute of Chinese Medical Sciences, University of Macau, Zhuhai, Macao SAR, China.
- Centre for Pharmaceutical Regulatory Sciences, University of Macau, Zhuhai, Macao SAR, China.
- Faculty of Health Sciences, University of Macau, Zhuhai, Macao SAR, China.
| | - Peiyi Lu
- Department of Social Work and Social Administration, University of Hong Kong, Hong Kong SAR, China.
| | - Chihua Li
- Institute of Chinese Medical Sciences, University of Macau, Zhuhai, Macao SAR, China.
- Centre for Pharmaceutical Regulatory Sciences, University of Macau, Zhuhai, Macao SAR, China.
- Faculty of Health Sciences, University of Macau, Zhuhai, Macao SAR, China.
- Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
25
|
Maitin AM, Nogales A, Fernández-Rincón S, Aranguren E, Cervera-Barba E, Denizon-Arranz S, Mateos-Rodríguez A, García-Tejedor ÁJ. Application of large language models in clinical record correction: a comprehensive study on various retraining methods. J Am Med Inform Assoc 2025; 32:341-348. [PMID: 39707579 PMCID: PMC11756697 DOI: 10.1093/jamia/ocae302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Revised: 10/23/2024] [Accepted: 11/25/2024] [Indexed: 12/23/2024] Open
Abstract
OBJECTIVES We evaluate the effectiveness of large language models (LLMs), specifically GPT-based (GPT-3.5 and GPT-4) and Llama-2 models (13B and 7B architectures), in autonomously assessing clinical records (CRs) to enhance medical education and diagnostic skills. MATERIALS AND METHODS Various techniques, including prompt engineering, fine-tuning (FT), and low-rank adaptation (LoRA), were implemented and compared on Llama-2 7B. These methods were assessed using prompts in both English and Spanish to determine their adaptability to different languages. Performance was benchmarked against GPT-3.5, GPT-4, and Llama-2 13B. RESULTS GPT-based models, particularly GPT-4, demonstrated promising performance closely aligned with specialist evaluations. Application of FT on Llama-2 7B improved text comprehension in Spanish, equating its performance to that of Llama-2 13B with English prompts. Low-rank adaptation significantly enhanced performance, surpassing GPT-3.5 results when combined with FT. This indicates LoRA's effectiveness in adapting open-source models for specific tasks. DISCUSSION While GPT-4 showed superior performance, FT and LoRA on Llama-2 7B proved crucial in improving language comprehension and task-specific accuracy. Identified limitations highlight the need for further research. CONCLUSION This study underscores the potential of LLMs in medical education, providing an innovative, effective approach to CR correction. Low-rank adaptation emerged as the most effective technique, enabling open-source models to perform on par with proprietary models. Future research should focus on overcoming current limitations to further improve model performance.
Collapse
Affiliation(s)
- Ana M Maitin
- CEIEC, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain
| | - Alberto Nogales
- CEIEC, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain
| | | | - Enrique Aranguren
- CEIEC, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain
| | - Emilio Cervera-Barba
- Facultad de Medicina, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain
| | - Sophia Denizon-Arranz
- Facultad de Medicina, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain
| | - Alonso Mateos-Rodríguez
- Facultad de Medicina, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain
| | | |
Collapse
|
26
|
Leng Y, Yang Y, Liu J, Jiang J, Zhou C. Evaluating the Feasibility of ChatGPT-4 as a Knowledge Resource in Bariatric Surgery: A Preliminary Assessment. Obes Surg 2025; 35:645-650. [PMID: 39821906 DOI: 10.1007/s11695-024-07666-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 12/29/2024] [Accepted: 12/31/2024] [Indexed: 01/19/2025]
Abstract
This study evaluates the feasibility of ChatGPT-4 as a knowledge resource in bariatric surgery. Using a problem set of 30 questions covering key aspects of bariatric care, responses were reviewed by three bariatric surgery experts. ChatGPT-4 achieved strong performance, with 50% of responses scoring the highest possible rating for alignment with clinical guidelines. However, limitations were noted, including outdated criteria, lack of specificity, and occasional poor response structuring. The study highlights the potential of ChatGPT-4 as a supplementary tool for patient education and healthcare provider support, as well as its broader public health applications, such as obesity prevention and healthy lifestyle education. Despite its promise, challenges such as handling complex clinical cases, reliance on up-to-date evidence, and ethical concerns like privacy and misinformation must be addressed. Future research should refine the model's applications and explore its integration into clinical practice and public health strategies.
Collapse
Affiliation(s)
- Yu Leng
- Department of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China
- Research Center of Anesthesiology, National-Local Joint Engineering Research Centre of Translational Medicine of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China
| | - Yaoxin Yang
- Department of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China
- Research Center of Anesthesiology, National-Local Joint Engineering Research Centre of Translational Medicine of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China
| | - Jin Liu
- Department of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China
- Research Center of Anesthesiology, National-Local Joint Engineering Research Centre of Translational Medicine of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China
| | - Jingyao Jiang
- Department of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China
- Research Center of Anesthesiology, National-Local Joint Engineering Research Centre of Translational Medicine of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China
| | - Cheng Zhou
- Research Center of Anesthesiology, National-Local Joint Engineering Research Centre of Translational Medicine of Anesthesiology, West China Hospital, Sichuan University, Chengdu, China.
| |
Collapse
|
27
|
Ding L, Fan L, Shen M, Wang Y, Sheng K, Zou Z, An H, Jiang Z. Evaluating ChatGPT's diagnostic potential for pathology images. Front Med (Lausanne) 2025; 11:1507203. [PMID: 39917264 PMCID: PMC11798939 DOI: 10.3389/fmed.2024.1507203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 12/27/2024] [Indexed: 02/09/2025] Open
Abstract
Background Chat Generative Pretrained Transformer (ChatGPT) is a type of large language model (LLM) developed by OpenAI, known for its extensive knowledge base and interactive capabilities. These attributes make it a valuable tool in the medical field, particularly for tasks such as answering medical questions, drafting clinical notes, and optimizing the generation of radiology reports. However, keeping accuracy in medical contexts is the biggest challenge to employing GPT-4 in a clinical setting. This study aims to investigate the accuracy of GPT-4, which can process both text and image inputs, in generating diagnoses from pathological images. Methods This study analyzed 44 histopathological images from 16 organs and 100 colorectal biopsy photomicrographs. The initial evaluation was conducted using the standard GPT-4 model in January 2024, with a subsequent re-evaluation performed in July 2024. The diagnostic accuracy of GPT-4 was assessed by comparing its outputs to a reference standard using statistical measures. Additionally, four pathologists independently reviewed the same images to compare their diagnoses with the model's outputs. Both scanned and photographed images were tested to evaluate GPT-4's generalization ability across different image types. Results GPT-4 achieved an overall accuracy of 0.64 in identifying tumor imaging and tissue origins. For colon polyp classification, accuracy varied from 0.57 to 0.75 in different subtypes. The model achieved 0.88 accuracy in distinguishing low-grade from high-grade dysplasia and 0.75 in distinguishing high-grade dysplasia from adenocarcinoma, with a high sensitivity in detecting adenocarcinoma. Consistency between initial and follow-up evaluations showed slight to moderate agreement, with Kappa values ranging from 0.204 to 0.375. Conclusion GPT-4 demonstrates the ability to diagnose pathological images, showing improved performance over earlier versions. Its diagnostic accuracy in cancer is comparable to that of pathology residents. These findings suggest that GPT-4 holds promise as a supportive tool in pathology diagnostics, offering the potential to assist pathologists in routine diagnostic workflows.
Collapse
Affiliation(s)
- Liya Ding
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Lei Fan
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Pathology, Ninghai County Traditional Chinese Medicine Hospital, Ningbo, China
| | - Miao Shen
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Pathology, Deqing People’s Hospital, Hangzhou, China
| | - Yawen Wang
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Kaiqin Sheng
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zijuan Zou
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Huimin An
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhinong Jiang
- Department of Pathology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
28
|
Zhang K, Meng X, Yan X, Ji J, Liu J, Xu H, Zhang H, Liu D, Wang J, Wang X, Gao J, Wang YGS, Shao C, Wang W, Li J, Zheng MQ, Yang Y, Tang YD. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. J Med Internet Res 2025; 27:e59069. [PMID: 39773666 PMCID: PMC11751657 DOI: 10.2196/59069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 08/26/2024] [Accepted: 09/10/2024] [Indexed: 01/11/2025] Open
Abstract
Large language models (LLMs) are rapidly advancing medical artificial intelligence, offering revolutionary changes in health care. These models excel in natural language processing (NLP), enhancing clinical support, diagnosis, treatment, and medical research. Breakthroughs, like GPT-4 and BERT (Bidirectional Encoder Representations from Transformer), demonstrate LLMs' evolution through improved computing power and data. However, their high hardware requirements are being addressed through technological advancements. LLMs are unique in processing multimodal data, thereby improving emergency, elder care, and digital medical procedures. Challenges include ensuring their empirical reliability, addressing ethical and societal implications, especially data privacy, and mitigating biases while maintaining privacy and accountability. The paper emphasizes the need for human-centric, bias-free LLMs for personalized medicine and advocates for equitable development and access. LLMs hold promise for transformative impacts in health care.
Collapse
Affiliation(s)
- Kuo Zhang
- Department of Cardiology, State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | | | - Xiangyu Yan
- School of Disaster and Emergency Medicine, Tianjin University, Tianjin, China
| | - Jiaming Ji
- Institute for Artificial Intelligence, Peking University, Beijing, China
| | | | - Hua Xu
- Division of Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong Kong, China (Hong Kong)
| | - Heng Zhang
- Institute for Artificial Intelligence, Hefei University of Technology, Hefei, Anhui, China
| | - Da Liu
- Department of Cardiology, the First Hospital of Hebei Medical University, Graduate School of Hebei Medical University, Shijiazhuang, Hebei, China
| | - Jingjia Wang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Xuliang Wang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Jun Gao
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Yuan-Geng-Shuo Wang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Chunli Shao
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Wenyao Wang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Jiarong Li
- Henley Business School, University of Reading, RG6 6UD, United Kingdom
| | - Ming-Qi Zheng
- Department of Cardiology, the First Hospital of Hebei Medical University, Graduate School of Hebei Medical University, Shijiazhuang, Hebei, China
| | - Yaodong Yang
- Institute for Artificial Intelligence, Peking University, Beijing, China
| | - Yi-Da Tang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| |
Collapse
|
29
|
Cheng HY. ChatGPT's Attitude, Knowledge, and Clinical Application in Geriatrics Practice and Education: Exploratory Observational Study. JMIR Form Res 2025; 9:e63494. [PMID: 39752214 PMCID: PMC11742095 DOI: 10.2196/63494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 10/26/2024] [Accepted: 11/17/2024] [Indexed: 01/04/2025] Open
Abstract
BACKGROUND The increasing use of ChatGPT in clinical practice and medical education necessitates the evaluation of its reliability, particularly in geriatrics. OBJECTIVE This study aimed to evaluate ChatGPT's trustworthiness in geriatrics through 3 distinct approaches: evaluating ChatGPT's geriatrics attitude, knowledge, and clinical application with 2 vignettes of geriatric syndromes (polypharmacy and falls). METHODS We used the validated University of California, Los Angeles, geriatrics attitude and knowledge instruments to evaluate ChatGPT's geriatrics attitude and knowledge and compare its performance with that of medical students, residents, and geriatrics fellows from reported results in the literature. We also evaluated ChatGPT's application to 2 vignettes of geriatric syndromes (polypharmacy and falls). RESULTS The mean total score on geriatrics attitude of ChatGPT was significantly lower than that of trainees (medical students, internal medicine residents, and geriatric medicine fellows; 2.7 vs 3.7 on a scale from 1-5; 1=strongly disagree; 5=strongly agree). The mean subscore on positive geriatrics attitude of ChatGPT was higher than that of the trainees (medical students, internal medicine residents, and neurologists; 4.1 vs 3.7 on a scale from 1 to 5 where a higher score means a more positive attitude toward older adults). The mean subscore on negative geriatrics attitude of ChatGPT was lower than that of the trainees and neurologists (1.8 vs 2.8 on a scale from 1 to 5 where a lower subscore means a less negative attitude toward aging). On the University of California, Los Angeles geriatrics knowledge test, ChatGPT outperformed all medical students, internal medicine residents, and geriatric medicine fellows from validated studies (14.7 vs 11.3 with a score range of -18 to +18 where +18 means that all questions were answered correctly). Regarding the polypharmacy vignette, ChatGPT not only demonstrated solid knowledge of potentially inappropriate medications but also accurately identified 7 common potentially inappropriate medications and 5 drug-drug and 3 drug-disease interactions. However, ChatGPT missed 5 drug-disease and 1 drug-drug interaction and produced 2 hallucinations. Regarding the fall vignette, ChatGPT answered 3 of 5 pretests correctly and 2 of 5 pretests partially correctly, identified 6 categories of fall risks, followed fall guidelines correctly, listed 6 key physical examinations, and recommended 6 categories of fall prevention methods. CONCLUSIONS This study suggests that ChatGPT can be a valuable supplemental tool in geriatrics, offering reliable information with less age bias, robust geriatrics knowledge, and comprehensive recommendations for managing 2 common geriatric syndromes (polypharmacy and falls) that are consistent with evidence from guidelines, systematic reviews, and other types of studies. ChatGPT's potential as an educational and clinical resource could significantly benefit trainees, health care providers, and laypeople. Further research using GPT-4o, larger geriatrics question sets, and more geriatric syndromes is needed to expand and confirm these findings before adopting ChatGPT widely for geriatrics education and practice.
Collapse
Affiliation(s)
- Huai Yong Cheng
- Minneapolis VA Health Care System, Minneapolis, MN, United States
| |
Collapse
|
30
|
Kinikoglu I. Evaluating ChatGPT and Google Gemini Performance and Implications in Turkish Dental Education. Cureus 2025; 17:e77292. [PMID: 39801704 PMCID: PMC11724709 DOI: 10.7759/cureus.77292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/11/2025] [Indexed: 01/16/2025] Open
Abstract
Artificial intelligence (AI) has emerged as a transformative tool in education, particularly in specialized fields such as dentistry. This study evaluated the performance of four advanced AI models - ChatGPT-4o (San Francisco, CA: OpenAI), ChatGPT-o1, Gemini 1.5 Pro (Mountain View, CA: Google LLC), and Gemini 2.0 Advanced, in the Turkish Dental Specialty Examination (DUS) for 2020 and 2021. A total of 240 questions, comprising 120 questions per year from basic and clinical sciences, were analyzed. AI models were assessed based on their accuracy in providing correct answers compared to the official answer keys. For the 2020 DUS, ChatGPT-o1 and Gemini 2.0 Advanced achieved the highest accuracy rates of 93.70% and 96.80%, respectively, with net scores of 112.50 and 115 out of 120 questions. ChatGPT-4o and Gemini 1.5 Pro followed with accuracy rates of 83.33% and 85.40%. For the 2021 DUS, ChatGPT-o1 again demonstrated the highest accuracy at 97.88% (115.50 net score), closely followed by Gemini 2.0 Advanced at 96.82% (114.25 net score). Overall, ChatGPT-4o and Gemini 1.5 Pro scored lower for 2021, achieving accuracy rates of 88.35% and 93.64%, respectively. Combining results from both years (238 total questions), ChatGPT-o1 and Gemini 2.0 Advanced achieved accuracy rates of 97.46% (230 correct answers, 95% CI: 94.62%, 100.00%) and 97.90% (231 correct answers, 95% CI: 94.62%, 100.00%), respectively, significantly outperforming ChatGPT-4o (88.66%, 211 correct answers, 95% CI: 85.43%, 91.89%) and Gemini 1.5 Pro (91.60%, 218 correct answers, 95% CI: 87.75%, 95.45%). Statistical analysis revealed significant differences among the models (p = 0.0002). Pairwise comparisons demonstrated that ChatGPT-4o underperformed significantly compared to ChatGPT-o1 (p = 0.0016) and Gemini 2.0 Advanced (p = 0.0007) after Bonferroni correction. The consistently high accuracy rates and narrow confidence intervals for the top-performing models underscore their superior reliability and performance in answering the DUS questions. Generative AI modules such as ChatGPT-01 and Gemini 2.0 have the potential to enhance dental board exam preparation through question evaluation. While the AI modules appear to outperform humans on DUS questions, the study raises a concern about the ethical uses of AI and the true justification and value of DUS examinations as dental competency examinations. A higher level of knowledge evaluation should be considered. This research contributes to the growing body of literature on AI applications in specialized knowledge domains and provides a foundation for further exploration of its integration into dental education.
Collapse
Affiliation(s)
- Ipek Kinikoglu
- Pedodontics, Istanbul Turkuaz Dental Clinic, Istanbul, TUR
| |
Collapse
|
31
|
Keat K, Venkatesh R, Huang Y, Kumar R, Tuteja S, Sangkuhl K, Li B, Gong L, Whirl-Carrillo M, Klein TE, Ritchie MD, Kim D. PGxQA: A Resource for Evaluating LLM Performance for Pharmacogenomic QA Tasks. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2025; 30:229-246. [PMID: 39670373 PMCID: PMC11734741 DOI: 10.1142/9789819807024_0017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/16/2025]
Abstract
Pharmacogenetics represents one of the most promising areas of precision medicine, with several guidelines for genetics-guided treatment ready for clinical use. Despite this, implementation has been slow, with few health systems incorporating the technology into their standard of care. One major barrier to uptake is the lack of education and awareness of pharmacogenetics among clinicians and patients. The introduction of large language models (LLMs) like GPT-4 has raised the possibility of medical chatbots that deliver timely information to clinicians, patients, and researchers with a simple interface. Although state-of-the-art LLMs have shown impressive performance at advanced tasks like medical licensing exams, in practice they still often provide false information, which is particularly hazardous in a clinical context. To quantify the extent of this issue, we developed a series of automated and expert-scored tests to evaluate the performance of chatbots in answering pharmacogenetics questions from the perspective of clinicians, patients, and researchers. We applied this benchmark to state-of-the-art LLMs and found that newer models like GPT-4o greatly outperform their predecessors, but still fall short of the standards required for clinical use. Our benchmark will be a valuable public resource for subsequent developments in this space as we work towards better clinical AI for pharmacogenetics.
Collapse
Affiliation(s)
- Karl Keat
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Rasika Venkatesh
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Yidi Huang
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Rachit Kumar
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Sony Tuteja
- Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Katrin Sangkuhl
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Binglan Li
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Li Gong
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | | | - Teri E Klein
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA4Department of Medicine (BMIR), Stanford University, Stanford, CA, USA
| | - Marylyn D Ritchie
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA,
| | - Dokyoon Kim
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA,
| |
Collapse
|
32
|
Qin H, Tong Y. Opportunities and Challenges for Large Language Models in Primary Health Care. J Prim Care Community Health 2025; 16:21501319241312571. [PMID: 40162893 PMCID: PMC11960148 DOI: 10.1177/21501319241312571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Revised: 12/14/2024] [Accepted: 12/17/2024] [Indexed: 04/02/2025] Open
Abstract
Primary Health Care (PHC) is the cornerstone of the global health care system and the primary objective for achieving universal health coverage. China's PHC system faces several challenges, including uneven distribution of medical resources, a lack of qualified primary healthcare personnel, an ineffective implementation of the hierarchical medical treatment, and a serious situation regarding the prevention and control of chronic diseases. The rapid advancement of artificial intelligence (AI) technology, large language models (LLMs) demonstrate significant potential in the medical field with their powerful natural language processing and reasoning capabilities, especially in PHC. This review focuses on the various potential applications of LLMs in China's PHC, including health promotion and disease prevention, medical consultation and health management, diagnosis and triage, chronic disease management, and mental health support. Additionally, pragmatic obstacles were analyzed, such as transparency, outcomes misrepresentation, privacy concerns, and social biases. Future development should emphasize interdisciplinary collaboration and resource sharing, ongoing improvements in health equity, and innovative advancements in medical large models. There is a demand to establish a safe, effective, equitable, and flexible ethical and legal framework, along with a robust accountability mechanism, to support the achievement of universal health coverage.
Collapse
Affiliation(s)
- Hongyang Qin
- The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
- Beigan Street Community Health Service Center, Xiaoshan District, Hangzhou, China
| | - Yuling Tong
- The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
33
|
Mishra T, Sutanto E, Rossanti R, Pant N, Ashraf A, Raut A, Uwabareze G, Oluwatomiwa A, Zeeshan B. Use of large language models as artificial intelligence tools in academic research and publishing among global clinical researchers. Sci Rep 2024; 14:31672. [PMID: 39738210 PMCID: PMC11685435 DOI: 10.1038/s41598-024-81370-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Accepted: 11/26/2024] [Indexed: 01/01/2025] Open
Abstract
With breakthroughs in Natural Language Processing and Artificial Intelligence (AI), the usage of Large Language Models (LLMs) in academic research has increased tremendously. Models such as Generative Pre-trained Transformer (GPT) are used by researchers in literature review, abstract screening, and manuscript drafting. However, these models also present the attendant challenge of providing ethically questionable scientific information. Our study provides a snapshot of global researchers' perception of current trends and future impacts of LLMs in research. Using a cross-sectional design, we surveyed 226 medical and paramedical researchers from 59 countries across 65 specialties, trained in the Global Clinical Scholars' Research Training certificate program of Harvard Medical School between 2020 and 2024. Majority (57.5%) of these participants practiced in an academic setting with a median of 7 (2,18) PubMed Indexed published articles. 198 respondents (87.6%) were aware of LLMs and those who were aware had higher number of publications (p < 0.001). 18.7% of the respondents who were aware (n = 37) had previously used LLMs in publications especially for grammatical errors and formatting (64.9%); however, most (40.5%) did not acknowledge its use in their papers. 50.8% of aware respondents (n = 95) predicted an overall positive future impact of LLMs while 32.6% were unsure of its scope. 52% of aware respondents (n = 102) believed that LLMs would have a major impact in areas such as grammatical errors and formatting (66.3%), revision and editing (57.2%), writing (57.2%) and literature review (54.2%). 58.1% of aware respondents were opined that journals should allow for use of AI in research and 78.3% believed that regulations should be put in place to avoid its abuse. Seeing the perception of researchers towards LLMs and the significant association between awareness of LLMs and number of published works, we emphasize the importance of developing comprehensive guidelines and ethical framework to govern the use of AI in academic research and address the current challenges.
Collapse
Affiliation(s)
- Tanisha Mishra
- Kasturba Medical College, Manipal, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
| | - Edward Sutanto
- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health, University of Oxford, Oxford, OX3 7LG, UK
- Faculty of Medicine, Oxford University Clinical Research Unit Indonesia, Universitas Indonesia, Jakarta, 10430, Indonesia
| | - Rini Rossanti
- Department of Child Health, Dr. Hasan Sadikin General Hospital/Faculty of Medicine, Universitas Padjadjaran, Bandung, Indonesia
| | - Nayana Pant
- Royal Free NHS Foundation Trust Hospital, Pond Street, London, NW32QG, UK
| | - Anum Ashraf
- Department of Pharmacology & Therapeutics, Allama Iqbal Medical College, Jinnah Hospital, Lahore, Pakistan
| | - Akshay Raut
- Department of Internal Medicine, Guthrie Robert Packer Hospital, Sayre, PA, 18840, USA
| | | | | | - Bushra Zeeshan
- Department of Dermatology, Niazi Hospital, Lahore, Pakistan.
- Allama Iqbal Medical College, Jinnah Hospital, Lahore, Pakistan.
| |
Collapse
|
34
|
Zong H, Wu R, Cha J, Wang J, Wu E, Li J, Zhou Y, Zhang C, Feng W, Shen B. Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis. J Med Internet Res 2024; 26:e66114. [PMID: 39729356 PMCID: PMC11724220 DOI: 10.2196/66114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 11/06/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024] Open
Abstract
BACKGROUND Large language models (LLMs) are increasingly integrated into medical education, with transformative potential for learning and assessment. However, their performance across diverse medical exams globally has remained underexplored. OBJECTIVE This study aims to introduce MedExamLLM, a comprehensive platform designed to systematically evaluate the performance of LLMs on medical exams worldwide. Specifically, the platform seeks to (1) compile and curate performance data for diverse LLMs on worldwide medical exams; (2) analyze trends and disparities in LLM capabilities across geographic regions, languages, and contexts; and (3) provide a resource for researchers, educators, and developers to explore and advance the integration of artificial intelligence in medical education. METHODS A systematic search was conducted on April 25, 2024, in the PubMed database to identify relevant publications. Inclusion criteria encompassed peer-reviewed, English-language, original research articles that evaluated at least one LLM on medical exams. Exclusion criteria included review articles, non-English publications, preprints, and studies without relevant data on LLM performance. The screening process for candidate publications was independently conducted by 2 researchers to ensure accuracy and reliability. Data, including exam information, data process information, model performance, data availability, and references, were manually curated, standardized, and organized. These curated data were integrated into the MedExamLLM platform, enabling its functionality to visualize and analyze LLM performance across geographic, linguistic, and exam characteristics. The web platform was developed with a focus on accessibility, interactivity, and scalability to support continuous data updates and user engagement. RESULTS A total of 193 articles were included for final analysis. MedExamLLM comprised information for 16 LLMs on 198 medical exams conducted in 28 countries across 15 languages from the year 2009 to the year 2023. The United States accounted for the highest number of medical exams and related publications, with English being the dominant language used in these exams. The Generative Pretrained Transformer (GPT) series models, especially GPT-4, demonstrated superior performance, achieving pass rates significantly higher than other LLMs. The analysis revealed significant variability in the capabilities of LLMs across different geographic and linguistic contexts. CONCLUSIONS MedExamLLM is an open-source, freely accessible, and publicly available online platform providing comprehensive performance evaluation information and evidence knowledge about LLMs on medical exams around the world. The MedExamLLM platform serves as a valuable resource for educators, researchers, and developers in the fields of clinical medicine and artificial intelligence. By synthesizing evidence on LLM capabilities, the platform provides valuable insights to support the integration of artificial intelligence into medical education. Limitations include potential biases in the data source and the exclusion of non-English literature. Future research should address these gaps and explore methods to enhance LLM performance in diverse contexts.
Collapse
Affiliation(s)
- Hui Zong
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Rongrong Wu
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Jiaxue Cha
- Shanghai Key Laboratory of Signaling and Disease Research, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Jiao Wang
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Erman Wu
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
- Department of Neurosurgery, First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
| | - Jiakun Li
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
- Department of Urology, West China Hospital, Sichuan University, Chengdu, China
| | - Yi Zhou
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Chi Zhang
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Weizhe Feng
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Bairong Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
- West China Tianfu Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
35
|
Zhuang S, Zeng Y, Lin S, Chen X, Xin Y, Li H, Lin Y, Zhang C, Lin Y. Evaluation of the ability of large language models to self-diagnose oral diseases. iScience 2024; 27:111495. [PMID: 39758998 PMCID: PMC11699252 DOI: 10.1016/j.isci.2024.111495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 09/17/2024] [Accepted: 11/26/2024] [Indexed: 01/03/2025] Open
Abstract
Large language models (LLMs) offer potential in primary dental care. We conducted an evaluation of LLMs' diagnostic capabilities across various oral diseases and contexts. All LLMs showed diagnostic capabilities for temporomandibular joint disorders, periodontal disease, dental caries, and malocclusion. The prompts did not affect the performance of ChatGPT 3.5. When Chinese was used, the diagnostic ability of ChatGPT 3.5 for pulpitis improved (0% vs. 61.7%, p < 0.001), while the ability to diagnose pericoronitis decreased (8% vs. 0%, p < 0.001). For ChatGPT 4.0 in Chinese, they were both improved (0% vs. 92%, 8% vs. 72%, p < 0.001, respectively). Claude 2 exhibited the highest accuracy in diagnosing pulpitis (36%, p = 0.048), ChatGPT 4.0 showed complete diagnostic capability for pericoronitis. Llama 2 and Claude 3.5 Sonnet exhibited complete diagnostic capability for oral cancer. In conclusion, LLMs may be a potential tool for daily dental care but need further updates.
Collapse
Affiliation(s)
- Shiyang Zhuang
- Department of Stomatology, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Department of Stomatology, National Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical University, Fuzhou 350212, China
- School of Stomatology, Fujian Medical University, Fuzhou 350212, China
| | - Yuanhao Zeng
- School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
| | - Shaojunjie Lin
- School of Stomatology, Fujian Medical University, Fuzhou 350212, China
| | - Xirui Chen
- School of Stomatology, Fujian Medical University, Fuzhou 350212, China
| | - Yishan Xin
- Department of Orthopaedic Surgery, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Department of Orthopaedic Surgery, National Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical University, Fuzhou 350212, China
- Fujian Provincial Institute of Orthopedics, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Fujian Orthopedic Bone and Joint Disease and Sports Rehabilitation Clinical Medical Research Center, Fuzhou 350212, China
| | - Hongyan Li
- Department of Orthopaedic Surgery, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Department of Orthopaedic Surgery, National Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical University, Fuzhou 350212, China
- Fujian Provincial Institute of Orthopedics, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Fujian Orthopedic Bone and Joint Disease and Sports Rehabilitation Clinical Medical Research Center, Fuzhou 350212, China
| | - Yiming Lin
- Department of Orthopaedic Surgery, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Department of Orthopaedic Surgery, National Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical University, Fuzhou 350212, China
- Fujian Provincial Institute of Orthopedics, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Fujian Orthopedic Bone and Joint Disease and Sports Rehabilitation Clinical Medical Research Center, Fuzhou 350212, China
| | - Chaofan Zhang
- Department of Orthopaedic Surgery, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Department of Orthopaedic Surgery, National Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical University, Fuzhou 350212, China
- Fujian Provincial Institute of Orthopedics, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Fujian Orthopedic Bone and Joint Disease and Sports Rehabilitation Clinical Medical Research Center, Fuzhou 350212, China
| | - Yunzhi Lin
- Department of Stomatology, the First Affiliated Hospital, Fujian Medical University, Fuzhou 350005, China
- Department of Stomatology, National Regional Medical Center, Binhai Campus of the First Affiliated Hospital, Fujian Medical University, Fuzhou 350212, China
| |
Collapse
|
36
|
Singal A, Goyal S. Reliability and efficiency of ChatGPT 3.5 and 4.0 as a tool for scalenovertebral triangle anatomy education. Surg Radiol Anat 2024; 47:24. [PMID: 39652180 DOI: 10.1007/s00276-024-03513-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2024] [Accepted: 11/18/2024] [Indexed: 01/11/2025]
Abstract
PURPOSE As the popularity and the usage of the artificial intelligence (AI) tools is increasing in medical education, it is important to critically evaluate these resources and confirm their reliability. The current study proposes to assess the reliability and effectiveness of ChatGPT 3.5 and 4 for gross anatomical information on scalenovertebral triangle. METHODS ChatGPT versions 3.5 and 4 AI tools were used to explore the anatomical information on scalenovertebral triangle eight times on different days. The responses were qualitatively compared to the actual anatomy of the region and comments were made by the authors for each response. RESULTS The replies given by ChatGPT were not appropriate (either incorrect, partially correct or incomplete) in any of the conversations. There was no major difference between the accuracy of responses, while comparing ChatGPT 3.5 and 4. Almost three out of four times, ChatGPT confused scalenovertebral triangle with scalene or interscalene triangle. CONCLUSIONS None of the responses provided by ChatGPT 3.5 and 4 across all eight instances aligned even once with the standard anatomical description of the scalenovertebral triangle. A novice medical student may not be able to judge the difference between correct and incorrect, consequently may wrongly interpret the anatomy. So cautious planning and educator check is important while it is used. Further development and modifications of this AI tool are required to increase its potential to be used in medical education and healthcare.
Collapse
Affiliation(s)
- Anjali Singal
- Department of Anatomy, All India Institute of Medical Sciences, Bathinda, Punjab, 151001, India.
| | - Swati Goyal
- Department of Radiodiagnosis, Gandhi Medical College, Bhopal, Madhya Pradesh, India
| |
Collapse
|
37
|
Geneş M, Deveci B. A Clinical Evaluation of Cardiovascular Emergencies: A Comparison of Responses from ChatGPT, Emergency Physicians, and Cardiologists. Diagnostics (Basel) 2024; 14:2731. [PMID: 39682639 DOI: 10.3390/diagnostics14232731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 11/23/2024] [Accepted: 12/02/2024] [Indexed: 12/18/2024] Open
Abstract
Background: Artificial intelligence (AI) tools, like ChatGPT, are gaining attention for their potential in supporting clinical decisions. This study evaluates the performance of ChatGPT-4o in acute cardiological cases compared to cardiologists and emergency physicians. Methods: Twenty acute cardiological scenarios were used to compare the responses of ChatGPT-4o, cardiologists, and emergency physicians in terms of accuracy, completeness, and response time. Statistical analyses included the Kruskal-Wallis H test and post hoc comparisons using the Mann-Whitney U test with Bonferroni correction. Results: ChatGPT-4o and cardiologists both achieved 100% correct response rates, while emergency physicians showed lower accuracy. ChatGPT-4o provided the fastest responses and obtained the highest accuracy and completeness scores. Statistically significant differences were found between ChatGPT-4o and emergency physicians (p < 0.001), and between cardiologists and emergency physicians (p < 0.001). A Cohen's kappa value of 0.92 indicated a high level of inter-rater agreement. Conclusions: ChatGPT-4o outperformed human clinicians in accuracy, completeness, and response time, highlighting its potential as a clinical decision support tool. However, human oversight remains essential to ensure safe AI integration in healthcare settings.
Collapse
Affiliation(s)
- Muhammet Geneş
- Cardiology Residency, Department of Cardiology, Sincan Training and Research Hospital, Ankara 06930, Turkey
| | - Bülent Deveci
- Cardiology Residency, Department of Cardiology, Sincan Training and Research Hospital, Ankara 06930, Turkey
| |
Collapse
|
38
|
Nosta J. The cognitive age in medicine: Artificial intelligence, large language models, and iterative intelligence. Am J Hematol 2024; 99:2256-2257. [PMID: 39282959 DOI: 10.1002/ajh.27480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Accepted: 09/03/2024] [Indexed: 11/13/2024]
|
39
|
Xu X, Yang Y, Tan X, Zhang Z, Wang B, Yang X, Weng C, Yu R, Zhao Q, Quan S. Hepatic encephalopathy post-TIPS: Current status and prospects in predictive assessment. Comput Struct Biotechnol J 2024; 24:493-506. [PMID: 39076168 PMCID: PMC11284497 DOI: 10.1016/j.csbj.2024.07.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2024] [Revised: 07/05/2024] [Accepted: 07/05/2024] [Indexed: 07/31/2024] Open
Abstract
Transjugular intrahepatic portosystemic shunt (TIPS) is an essential procedure for the treatment of portal hypertension but can result in hepatic encephalopathy (HE), a serious complication that worsens patient outcomes. Investigating predictors of HE after TIPS is essential to improve prognosis. This review analyzes risk factors and compares predictive models, weighing traditional scores such as Child-Pugh, Model for End-Stage Liver Disease (MELD), and albumin-bilirubin (ALBI) against emerging artificial intelligence (AI) techniques. While traditional scores provide initial insights into HE risk, they have limitations in dealing with clinical complexity. Advances in machine learning (ML), particularly when integrated with imaging and clinical data, offer refined assessments. These innovations suggest the potential for AI to significantly improve the prediction of post-TIPS HE. The study provides clinicians with a comprehensive overview of current prediction methods, while advocating for the integration of AI to increase the accuracy of post-TIPS HE assessments. By harnessing the power of AI, clinicians can better manage the risks associated with TIPS and tailor interventions to individual patient needs. Future research should therefore prioritize the development of advanced AI frameworks that can assimilate diverse data streams to support clinical decision-making. The goal is not only to more accurately predict HE, but also to improve overall patient care and quality of life.
Collapse
Affiliation(s)
- Xiaowei Xu
- Department of Gastroenterology Nursing Unit, Ward 192, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325000, China
| | - Yun Yang
- School of Nursing, Wenzhou Medical University, Wenzhou 325001, China
| | - Xinru Tan
- The First School of Medicine, School of Information and Engineering, Wenzhou Medical University, Wenzhou 325001, China
| | - Ziyang Zhang
- School of Clinical Medicine, Guizhou Medical University, Guiyang 550025, China
| | - Boxiang Wang
- The First School of Medicine, School of Information and Engineering, Wenzhou Medical University, Wenzhou 325001, China
| | - Xiaojie Yang
- Wenzhou Medical University Renji College, Wenzhou 325000, China
| | - Chujun Weng
- The Fourth Affiliated Hospital Zhejiang University School of Medicine, Yiwu 322000, China
| | - Rongwen Yu
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan 114051, China
| | - Shichao Quan
- Department of Big Data in Health Science, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325000, China
| |
Collapse
|
40
|
Umer F, Batool I, Naved N. Innovation and application of Large Language Models (LLMs) in dentistry - a scoping review. BDJ Open 2024; 10:90. [PMID: 39617779 PMCID: PMC11609263 DOI: 10.1038/s41405-024-00277-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 11/03/2024] [Accepted: 11/04/2024] [Indexed: 01/31/2025] Open
Abstract
OBJECTIVE Large Language Models (LLMs) have revolutionized healthcare, yet their integration in dentistry remains underexplored. Therefore, this scoping review aims to systematically evaluate current literature on LLMs in dentistry. DATA SOURCES The search covered PubMed, Scopus, IEEE Xplore, and Google Scholar, with studies selected based on predefined criteria. Data were extracted to identify applications, evaluation metrics, prompting strategies, and deployment levels of LLMs in dental practice. RESULTS From 4079 records, 17 studies met the inclusion criteria. ChatGPT was the predominant model, mainly used for post-operative patient queries. Likert scale was the most reported evaluation metric, and only two studies employed advanced prompting strategies. Most studies were at level 3 of deployment, indicating practical application but requiring refinement. CONCLUSION LLMs showed extensive applicability in dental specialties; however, reliance on ChatGPT necessitates diversified assessments across multiple LLMs. Standardizing reporting practices and employing advanced prompting techniques are crucial for transparency and reproducibility, necessitating continuous efforts to optimize LLM utility and address existing challenges.
Collapse
Affiliation(s)
- Fahad Umer
- Associate Professor, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan
| | - Itrat Batool
- Resident, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan
| | - Nighat Naved
- Resident, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan.
| |
Collapse
|
41
|
Omar M, Nadkarni GN, Klang E, Glicksberg BS. Large language models in medicine: A review of current clinical trials across healthcare applications. PLOS DIGITAL HEALTH 2024; 3:e0000662. [PMID: 39561120 PMCID: PMC11575759 DOI: 10.1371/journal.pdig.0000662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2024]
Abstract
This review analyzes current clinical trials investigating large language models' (LLMs) applications in healthcare. We identified 27 trials (5 published and 22 ongoing) across 4 main clinical applications: patient care, data handling, decision support, and research assistance. Our analysis reveals diverse LLM uses, from clinical documentation to medical decision-making. Published trials show promise but highlight accuracy concerns. Ongoing studies explore novel applications like patient education and informed consent. Most trials occur in the United States of America and China. We discuss the challenges of evaluating rapidly evolving LLMs through clinical trials and identify gaps in current research. This review aims to inform future studies and guide the integration of LLMs into clinical practice.
Collapse
Affiliation(s)
- Mahmud Omar
- Maccabi Health Services, Israel
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, United States of America
| | - Girish N Nadkarni
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, United States of America
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, United States of America
| | - Eyal Klang
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, United States of America
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, United States of America
| | - Benjamin S Glicksberg
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, United States of America
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, United States of America
| |
Collapse
|
42
|
Lee C, Britto S, Diwan K. Evaluating the Impact of Artificial Intelligence (AI) on Clinical Documentation Efficiency and Accuracy Across Clinical Settings: A Scoping Review. Cureus 2024; 16:e73994. [PMID: 39703286 PMCID: PMC11658896 DOI: 10.7759/cureus.73994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/18/2024] [Indexed: 12/21/2024] Open
Abstract
Artificial intelligence (AI) technologies (natural language processing (NLP), speech recognition (SR), and machine learning (ML)) can transform clinical documentation in healthcare. This scoping review evaluates the impact of AI on the accuracy and efficiency of clinical documentation across various clinical settings (hospital wards, emergency departments, and outpatient clinics). We found 176 articles by applying a specific search string on Ovid. To ensure a more comprehensive search process, we also performed manual searches on PubMed and BMJ, examining any relevant references we encountered. In this way, we were able to add 46 more articles, resulting in 222 articles in total. After removing duplicates, 208 articles were screened. This led to the inclusion of 36 studies. We were mostly interested in articles discussing the impact of AI technologies, such as NLP, ML, and SR, and their accuracy and efficiency in clinical documentation. To ensure that our research reflected recent work, we focused our efforts on studies published in 2019 and beyond. This criterion was pilot-tested beforehand and necessary adjustments were made. After comparing screened articles independently, we ensured inter-rater reliability (Cohen's kappa=1.0), and data extraction was completed on these 36 articles. We conducted this study according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. This scoping review shows improvements in clinical documentation using AI technologies, with an emphasis on accuracy and efficiency. There was a reduction in clinician workload, with the streamlining of the documentation processes. Subsequently, doctors also had more time for patient care. However, these articles also raised various challenges surrounding the use of AI in clinical settings. These challenges included the management of errors, legal liability, and integration of AI with electronic health records (EHRs). There were also some ethical concerns regarding the use of AI with patient data. AI shows massive potential for improving the day-to-day work life of doctors across various clinical settings. However, more research is needed to address the many challenges associated with its use. Studies demonstrate improved accuracy and efficiency in clinical documentation with the use of AI. With better regulatory frameworks, implementation, and research, AI can significantly reduce the burden placed on doctors by documentation.
Collapse
Affiliation(s)
- Craig Lee
- General Internal Medicine, University Hospitals Plymouth NHS Trust, Plymouth, GBR
| | - Shawn Britto
- General Internal Medicine, University Hospitals Plymouth NHS Trust, Plymouth, GBR
| | - Khaled Diwan
- General Internal Medicine, University Hospitals Plymouth NHS Trust, Plymouth, GBR
| |
Collapse
|
43
|
Gill GS, Blair J, Litinsky S. Evaluating the Performance of ChatGPT 3.5 and 4.0 on StatPearls Oculoplastic Surgery Text- and Image-Based Exam Questions. Cureus 2024; 16:e73812. [PMID: 39691123 PMCID: PMC11650114 DOI: 10.7759/cureus.73812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 10/27/2024] [Indexed: 12/19/2024] Open
Abstract
INTRODUCTION The emergence of large language models (LLMs) has led to significant interest in their potential use as medical assistive tools. Prior investigations have analyzed the overall comparative performance of LLM versions within different ophthalmology subspecialties. However, limited investigations have characterized LLM performance on image-based questions, a recent advance in LLM capabilities. The purpose of this study was to evaluate the performance of Chat Generative Pre-Trained Transformers (ChatGPT) versions 3.5 and 4.0 on image-based and text-only questions using oculoplastic subspecialty questions from StatPearls and OphthoQuestions question banks. METHODS This study utilized 343 non-image questions from StatPearls, 127 images from StatPearls, and 89 OphthoQuestions. All of these questions were specific to Oculoplastics. The information collected included correctness, distribution of answers, and if an additional prompt was necessary. Text-only questions were compared between ChatGPT-3.5 and ChatGPT-4.0. Also, text-only and multimodal (image-based) questions answered by ChatGPT-4.0 were compared. RESULTS ChatGPT-3.5 answered 56.85% (195/343) of text-only questions correctly, while ChatGPT-4.0 achieved 73.46% (252/343), showing a statistically significant difference in accuracy (p<0.05). The biserial correlation between ChatGPT-3.5 and human performance on the StatPearls question bank was 0.198, with a standard deviation of 0.195. When ChatGPT-3.5 was incorrect, the average human correctness was 49.39% (SD 26.27%), and when it was correct, human correctness averaged 57.82% (SD 30.14%) with a t-statistic of 3.57 and a p-value of 0.0004. For ChatGPT-4.0, the biserial correlation was 0.226 (SD 0.213). When ChatGPT-4.0 was incorrect, human correctness averaged 45.49% (SD 24.85%), and when it was correct, human correctness was 57.02% (SD 29.75%) with a t-statistic of 4.28 and a p-value of 0.0006. On image-only questions, ChatGPT-4.0 correctly answered 56.94% (123/216), significantly lower than its performance on text-only questions (p<0.05). DISCUSSION AND CONCLUSION This study shows that ChatGPT-4.0 performs better on the oculoplastic subspecialty than prior versions. However, significant challenges remain regarding accuracy, particularly when integrating image-based prompts. While showing promise within medical education, further progress must be made regarding LLM reliability, and caution should be used until further advancement is achieved.
Collapse
Affiliation(s)
- Gurnoor S Gill
- Medical School, Florida Atlantic University Charles E. Schmidt College of Medicine, Boca Raton, USA
| | - Jacob Blair
- Ophthalmology, Larkin Community Hospital (LCH) Lake Erie College of Osteopathic Medicine (LECOM), Miami, USA
| | - Steven Litinsky
- Ophthalmology, Florida Atlantic University Charles E. Schmidt College of Medicine, Boca Raton, USA
| |
Collapse
|
44
|
Leon M, Ruaengsri C, Pelletier G, Bethencourt D, Shibata M, Flores MQ, Shudo Y. Harnessing the Power of ChatGPT in Cardiovascular Medicine: Innovations, Challenges, and Future Directions. J Clin Med 2024; 13:6543. [PMID: 39518681 PMCID: PMC11546989 DOI: 10.3390/jcm13216543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2024] [Revised: 10/08/2024] [Accepted: 10/29/2024] [Indexed: 11/16/2024] Open
Abstract
Cardiovascular diseases remain the leading cause of morbidity and mortality globally, posing significant challenges to public health. The rapid evolution of artificial intelligence (AI), particularly with large language models such as ChatGPT, has introduced transformative possibilities in cardiovascular medicine. This review examines ChatGPT's broad applications in enhancing clinical decision-making-covering symptom analysis, risk assessment, and differential diagnosis; advancing medical education for both healthcare professionals and patients; and supporting research and academic communication. Key challenges associated with ChatGPT, including potential inaccuracies, ethical considerations, data privacy concerns, and inherent biases, are discussed. Future directions emphasize improving training data quality, developing specialized models, refining AI technology, and establishing regulatory frameworks to enhance ChatGPT's clinical utility and mitigate associated risks. As cardiovascular medicine embraces AI, ChatGPT stands out as a powerful tool with substantial potential to improve therapeutic outcomes, elevate care quality, and advance research innovation. Fully understanding and harnessing this potential is essential for the future of cardiovascular health.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Yasuhiro Shudo
- Department of Cardiothoracic Surgery, Stanford University School of Medicine, 300 Pasteur Drive, Falk CVRB, Stanford, CA 94305, USA; (C.R.); (G.P.); (D.B.); (M.Q.F.)
| |
Collapse
|
45
|
Aydin S, Karabacak M, Vlachos V, Margetis K. Large language models in patient education: a scoping review of applications in medicine. Front Med (Lausanne) 2024; 11:1477898. [PMID: 39534227 PMCID: PMC11554522 DOI: 10.3389/fmed.2024.1477898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Accepted: 10/03/2024] [Indexed: 11/16/2024] Open
Abstract
Introduction Large Language Models (LLMs) are sophisticated algorithms that analyze and generate vast amounts of textual data, mimicking human communication. Notable LLMs include GPT-4o by Open AI, Claude 3.5 Sonnet by Anthropic, and Gemini by Google. This scoping review aims to synthesize the current applications and potential uses of LLMs in patient education and engagement. Materials and methods Following the PRISMA-ScR checklist and methodologies by Arksey, O'Malley, and Levac, we conducted a scoping review. We searched PubMed in June 2024, using keywords and MeSH terms related to LLMs and patient education. Two authors conducted the initial screening, and discrepancies were resolved by consensus. We employed thematic analysis to address our primary research question. Results The review identified 201 studies, predominantly from the United States (58.2%). Six themes emerged: generating patient education materials, interpreting medical information, providing lifestyle recommendations, supporting customized medication use, offering perioperative care instructions, and optimizing doctor-patient interaction. LLMs were found to provide accurate responses to patient queries, enhance existing educational materials, and translate medical information into patient-friendly language. However, challenges such as readability, accuracy, and potential biases were noted. Discussion LLMs demonstrate significant potential in patient education and engagement by creating accessible educational materials, interpreting complex medical information, and enhancing communication between patients and healthcare providers. Nonetheless, issues related to the accuracy and readability of LLM-generated content, as well as ethical concerns, require further research and development. Future studies should focus on improving LLMs and ensuring content reliability while addressing ethical considerations.
Collapse
Affiliation(s)
- Serhat Aydin
- School of Medicine, Koç University, Istanbul, Türkiye
| | - Mert Karabacak
- Department of Neurosurgery, Mount Sinai Health System, New York, NY, United States
| | - Victoria Vlachos
- College of Human Ecology, Cornell University, Ithaca, NY, United States
| | | |
Collapse
|
46
|
Biard M, Detcheverry FE, Betzner W, Becker S, Grewal KS, Azab S, Bloniasz PF, Mazerolle EL, Phelps J, Smith EE, Badhwar A. Supporting decision-making for individuals living with dementia and their care partners with knowledge translation: an umbrella review. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.09.17.24312581. [PMID: 39371149 PMCID: PMC11451719 DOI: 10.1101/2024.09.17.24312581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/08/2024]
Abstract
Living with dementia requires decision-making about numerous topics including daily activities, such as advance care planning (ACP). Both individuals living with dementia and care partners require informed support for decision-making. We conducted an umbrella review to assess knowledge translation (KT) interventions supporting decision-making for individuals living with dementia and their informal care partners. Four databases were searched using 50 different search-terms, identifying 22 reviews presenting 32 KT interventions. The most common KT decision topic was ACP (N=21) which includes advanced care directives, feeding options, and placement in long-term care. The majority of KT interventions targeted care partners only (N=16), or both care partners and individuals living with dementia (N=13), with fewer interventions (N=3) targeting individuals living with dementia. Overall, our umbrella review offers insights into the beneficial impacts of KT interventions, such as increased knowledge and confidence, and decreased decisional conflicts.
Collapse
|
47
|
Sridharan K, Sivaramakrishnan G. Unlocking the potential of advanced large language models in medication review and reconciliation: A proof-of-concept investigation. EXPLORATORY RESEARCH IN CLINICAL AND SOCIAL PHARMACY 2024; 15:100492. [PMID: 39257533 PMCID: PMC11385755 DOI: 10.1016/j.rcsop.2024.100492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 08/08/2024] [Accepted: 08/13/2024] [Indexed: 09/12/2024] Open
Abstract
Background Medication review and reconciliation is essential for optimizing drug therapy and minimizing medication errors. Large language models (LLMs) have been recently shown to possess a lot of potential applications in healthcare field due to their abilities of deductive, abductive, and logical reasoning. The present study assessed the abilities of LLMs in medication review and medication reconciliation processes. Methods Four LLMs were prompted with appropriate queries related to dosing regimen errors, drug-drug interactions, therapeutic drug monitoring, and genomics-based decision-making process. The veracity of the LLM outputs were verified from validated sources using pre-validated criteria (accuracy, relevancy, risk management, hallucination mitigation, and citations and guidelines). The impacts of the erroneous responses on the patients' safety were categorized either as major or minor. Results In the assessment of four LLMs regarding dosing regimen errors, drug-drug interactions, and suggestions for dosing regimen adjustments based on therapeutic drug monitoring and genomics-based individualization of drug therapy, responses were generally consistent across prompts with no clear pattern in response quality among the LLMs. For identification of dosage regimen errors, ChatGPT performed well overall, except for the query related to simvastatin. In terms of potential drug-drug interactions, all LLMs recognized interactions with warfarin but missed the interaction between metoprolol and verapamil. Regarding dosage modifications based on therapeutic drug monitoring, Claude-Instant provided appropriate suggestions for two scenarios and nearly appropriate suggestions for the other two. Similarly, for genomics-based decision-making, Claude-Instant offered satisfactory responses for four scenarios, followed by Gemini for three. Notably, Gemini stood out by providing references to guidelines or citations even without prompting, demonstrating a commitment to accuracy and reliability in its responses. Minor impacts were noted in identifying appropriate dosing regimens and therapeutic drug monitoring, while major impacts were found in identifying drug interactions and making pharmacogenomic-based therapeutic decisions. Conclusion Advanced LLMs hold significant promise in revolutionizing the medication review and reconciliation process in healthcare. Diverse impacts on patient safety were observed. Integrating and validating LLMs within electronic health records and prescription systems is essential to harness their full potential and enhance patient safety and care quality.
Collapse
Affiliation(s)
- Kannan Sridharan
- Department of Pharmacology & Therapeutics, College of Medicine & Medical Sciences, Arabian Gulf University, Manama, Bahrain
| | | |
Collapse
|
48
|
Gill GS, Tsai J, Moxam J, Sanghvi HA, Gupta S. Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks. Cureus 2024; 16:e69612. [PMID: 39421095 PMCID: PMC11486483 DOI: 10.7759/cureus.69612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/17/2024] [Indexed: 10/19/2024] Open
Abstract
Background With advancements in natural language processing, tools such as Chat Generative Pre-Trained Transformers (ChatGPT) version 4.0 and Google Bard's Gemini Advanced are being increasingly evaluated for their potential in various medical applications. The objective of this study was to systematically assess the performance of these language learning models (LLMs) on both image and non-image-based questions within the specialized field of Ophthalmology. We used a review question bank for the Ophthalmic Knowledge Assessment Program (OKAP) used by ophthalmology residents nationally to prepare for the Ophthalmology Board Exam to assess the accuracy and performance of ChatGPT and Gemini Advanced. Methodology A total of 260 randomly generated multiple-choice questions from the OphthoQuestions question bank were run through ChatGPT and Gemini Advanced. A simulated 260-question OKAP examination was created at random from the bank. Question-specific data were analyzed, including overall percent correct, subspecialty accuracy, whether the question was "high yield," difficulty (1-4), and question type (e.g., image, text). To compare the performance of ChatGPT-4 and Gemini on the difficulty of questions, we utilized the standard deviation of user answer choices to determine question difficulty. In this study, a statistical analysis of Google Sheets was conducted using two-tailed t-tests with unequal variance to compare the performance of ChatGPT-4.0 and Google's Gemini Advanced across various question types, subspecialties, and difficulty levels. Results In total, 259 of the 260 questions were included in the study as one question used a video that any form of ChatGPT could not interpret as of May 1, 2024. For text-only questions, ChatGPT-4.0.0 correctly answered 57.14% (148/259, p < 0.018), and Gemini Advanced correctly answered 46.72% (121/259, p < 0.018). Both versions answered most questions without a prompt and would have received a below-average score on the OKAP. Moreover, there were 27 questions utilizing a secondary prompt in ChatGPT-4.0 compared to 67 questions in Gemini Advanced. ChatGPT-4.0 performed 68.99% on easier questions (<2 on a scale from 1-4) and 44.96% on harder questions (>2 on a scale from 1-4). On the other hand, Gemini Advanced performed 49.61% on easier questions (<2 on a scale from 1-4) and 44.19% on harder questions (>2 on a scale from 1-4). There was a statistically significant difference in accuracy between ChatGPT-4.0 and Gemini Advanced for easy (p < 0.0015) but not for hard (p < 0.55) questions. For image-only questions, ChatGPT-4.0 correctly answered 39.58% (19/48, p < 0.013), and Gemini Advanced correctly answered 33.33% (16/48, p < 0.022), resulting in a statistically insignificant difference in accuracy between ChatGPT-4.0 and Gemini Advanced (p < 0.530). A comparison against text-only and image-based questions demonstrated a statistically significant difference in accuracy for both ChatGPT-4.0 (p < 0.013) and Gemini Advanced (p < 0.022). Conclusions This study provides evidence that ChatGPT-4.0 performs better on the OKAP-style exams and is improved compared to Gemini Advanced within the context of ophthalmic multiple-choice questions. This may show an opportunity for increased worth for ChatGPT in ophthalmic medical education. While showing promise within medical education, caution should be used as a more detailed evaluation of reliability is needed.
Collapse
Affiliation(s)
- Gurnoor S Gill
- Medical School, Florida Atlantic University Charles E. Schmidt College of Medicine, Boca Raton, USA
| | - Joby Tsai
- Ophthalmology, Broward Health, Fort Lauderdale, USA
| | - Jillene Moxam
- School of Medicine, University of Florida, Gainesville, USA
- Department of Technology and Clinical Trials, Advanced Research, Deerfield Beach, USA
| | - Harshal A Sanghvi
- Department of Biomedical Sciences, Florida Atlantic University, Boca Raton, USA
- Department of Technology and Clinical Trials, Advanced Research, Deerfield Beach, USA
| | | |
Collapse
|
49
|
Hwai H, Ho YJ, Wang CH, Huang CH. Large language model application in emergency medicine and critical care. J Formos Med Assoc 2024:S0929-6646(24)00400-5. [PMID: 39198112 DOI: 10.1016/j.jfma.2024.08.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 08/13/2024] [Accepted: 08/23/2024] [Indexed: 09/01/2024] Open
Abstract
In the rapidly evolving healthcare landscape, artificial intelligence (AI), particularly the large language models (LLMs), like OpenAI's Chat Generative Pretrained Transformer (ChatGPT), has shown transformative potential in emergency medicine and critical care. This review article highlights the advancement and applications of ChatGPT, from diagnostic assistance to clinical documentation and patient communication, demonstrating its ability to perform comparably to human professionals in medical examinations. ChatGPT could assist clinical decision-making and medication selection in critical care, showcasing its potential to optimize patient care management. However, integrating LLMs into healthcare raises legal, ethical, and privacy concerns, including data protection and the necessity for informed consent. Finally, we addressed the challenges related to the accuracy of LLMs, such as the risk of providing incorrect medical advice. These concerns underscore the importance of ongoing research and regulation to ensure their ethical and practical use in healthcare.
Collapse
Affiliation(s)
- Haw Hwai
- Department of Emergency Medicine, National Taiwan University Hospital, National Taiwan University Medical College, Taipei, Taiwan.
| | - Yi-Ju Ho
- Department of Emergency Medicine, National Taiwan University Hospital, National Taiwan University Medical College, Taipei, Taiwan.
| | - Chih-Hung Wang
- Department of Emergency Medicine, National Taiwan University Hospital, National Taiwan University Medical College, Taipei, Taiwan.
| | - Chien-Hua Huang
- Department of Emergency Medicine, National Taiwan University Hospital, National Taiwan University Medical College, Taipei, Taiwan.
| |
Collapse
|
50
|
Gan W, Ouyang J, Li H, Xue Z, Zhang Y, Dong Q, Huang J, Zheng X, Zhang Y. Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial. J Med Internet Res 2024; 26:e57037. [PMID: 39163598 PMCID: PMC11372336 DOI: 10.2196/57037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 06/10/2024] [Accepted: 06/27/2024] [Indexed: 08/22/2024] Open
Abstract
BACKGROUND ChatGPT is a natural language processing model developed by OpenAI, which can be iteratively updated and optimized to accommodate the changing and complex requirements of human verbal communication. OBJECTIVE The study aimed to evaluate ChatGPT's accuracy in answering orthopedics-related multiple-choice questions (MCQs) and assess its short-term effects as a learning aid through a randomized controlled trial. In addition, long-term effects on student performance in other subjects were measured using final examination results. METHODS We first evaluated ChatGPT's accuracy in answering MCQs pertaining to orthopedics across various question formats. Then, 129 undergraduate medical students participated in a randomized controlled study in which the ChatGPT group used ChatGPT as a learning tool, while the control group was prohibited from using artificial intelligence software to support learning. Following a 2-week intervention, the 2 groups' understanding of orthopedics was assessed by an orthopedics test, and variations in the 2 groups' performance in other disciplines were noted through a follow-up at the end of the semester. RESULTS ChatGPT-4.0 answered 1051 orthopedics-related MCQs with a 70.60% (742/1051) accuracy rate, including 71.8% (237/330) accuracy for A1 MCQs, 73.7% (330/448) accuracy for A2 MCQs, 70.2% (92/131) accuracy for A3/4 MCQs, and 58.5% (83/142) accuracy for case analysis MCQs. As of April 7, 2023, a total of 129 individuals participated in the experiment. However, 19 individuals withdrew from the experiment at various phases; thus, as of July 1, 2023, a total of 110 individuals accomplished the trial and completed all follow-up work. After we intervened in the learning style of the students in the short term, the ChatGPT group answered more questions correctly than the control group (ChatGPT group: mean 141.20, SD 26.68; control group: mean 130.80, SD 25.56; P=.04) in the orthopedics test, particularly on A1 (ChatGPT group: mean 46.57, SD 8.52; control group: mean 42.18, SD 9.43; P=.01), A2 (ChatGPT group: mean 60.59, SD 10.58; control group: mean 56.66, SD 9.91; P=.047), and A3/4 MCQs (ChatGPT group: mean 19.57, SD 5.48; control group: mean 16.46, SD 4.58; P=.002). At the end of the semester, we found that the ChatGPT group performed better on final examinations in surgery (ChatGPT group: mean 76.54, SD 9.79; control group: mean 72.54, SD 8.11; P=.02) and obstetrics and gynecology (ChatGPT group: mean 75.98, SD 8.94; control group: mean 72.54, SD 8.66; P=.04) than the control group. CONCLUSIONS ChatGPT answers orthopedics-related MCQs accurately, and students using it excel in both short-term and long-term assessments. Our findings strongly support ChatGPT's integration into medical education, enhancing contemporary instructional methods. TRIAL REGISTRATION Chinese Clinical Trial Registry Chictr2300071774; https://www.chictr.org.cn/hvshowproject.html ?id=225740&v=1.0.
Collapse
Affiliation(s)
- Wenyi Gan
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Jianfeng Ouyang
- Department of Joint Surgery and Sports Medicine, Zhuhai People's Hospital (Zhuhai Hospital Affiliated With Jinan University), Zhuhai, Guangdong, China
| | - Hua Li
- Department of Orthopaedics, Beijing Jishuitan Hospital, Beijing, China
| | - Zhaowen Xue
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Yiming Zhang
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Qiu Dong
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Jiadong Huang
- Jinan University-University of Birmingham Joint Institute, Jinan University, Guangzhou, China
| | - Xiaofei Zheng
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Yiyi Zhang
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| |
Collapse
|