1
|
Wang L, Bhanushali T, Huang Z, Yang J, Badami S, Hightow-Weidman L. Evaluating Generative AI in Mental Health: Systematic Review of Capabilities and Limitations. JMIR Ment Health 2025; 12:e70014. [PMID: 40373033 DOI: 10.2196/70014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Revised: 02/11/2025] [Accepted: 02/18/2025] [Indexed: 05/17/2025] Open
Abstract
Background The global shortage of mental health professionals, exacerbated by increasing mental health needs post COVID-19, has stimulated growing interest in leveraging large language models to address these challenges. objectives This systematic review aims to evaluate the current capabilities of generative artificial intelligence (GenAI) models in the context of mental health applications. Methods A comprehensive search across 5 databases yielded 1046 references, of which 8 studies met the inclusion criteria. The included studies were original research with experimental designs (eg, Turing tests, sociocognitive tasks, trials, or qualitative methods); a focus on GenAI models; and explicit measurement of sociocognitive abilities (eg, empathy and emotional awareness), mental health outcomes, and user experience (eg, perceived trust and empathy). Results The studies, published between 2023 and 2024, primarily evaluated models such as ChatGPT-3.5 and 4.0, Bard, and Claude in tasks such as psychoeducation, diagnosis, emotional awareness, and clinical interventions. Most studies used zero-shot prompting and human evaluators to assess the AI responses, using standardized rating scales or qualitative analysis. However, these methods were often insufficient to fully capture the complexity of GenAI capabilities. The reliance on single-shot prompting techniques, limited comparisons, and task-based assessments isolated from a context may oversimplify GenAI's abilities and overlook the nuances of human-artificial intelligence interaction, especially in clinical applications that require contextual reasoning and cultural sensitivity. The findings suggest that while GenAI models demonstrate strengths in psychoeducation and emotional awareness, their diagnostic accuracy, cultural competence, and ability to engage users emotionally remain limited. Users frequently reported concerns about trustworthiness, accuracy, and the lack of emotional engagement. Conclusions Future research could use more sophisticated evaluation methods, such as few-shot and chain-of-thought prompting to fully uncover GenAI's potential. Longitudinal studies and broader comparisons with human benchmarks are needed to explore the effects of GenAI-integrated mental health care.
Collapse
Affiliation(s)
- Liying Wang
- Institute on Digital Health and Innovation, College of Nursing, Florida State University, 222 S Copeland St, Tallahassee, FL, 32306, United States, 1 (850) 644-3296
- Center of Population Sciences for Health Equity, College of Nursing, Florida State University, Tallahassee, FL, United States
| | - Tanmay Bhanushali
- College of Arts and Sciences, University of Washington, Seattle, WA, United States
| | - Zhuoran Huang
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, United States
| | - Jingyi Yang
- Teachers College, Columbia University, New York, NY, United States
| | - Sukriti Badami
- College of Arts and Sciences, University of Washington, Seattle, WA, United States
| | - Lisa Hightow-Weidman
- Institute on Digital Health and Innovation, College of Nursing, Florida State University, 222 S Copeland St, Tallahassee, FL, 32306, United States, 1 (850) 644-3296
| |
Collapse
|
2
|
Jin Y, Liu J, Li P, Wang B, Yan Y, Zhang H, Ni C, Wang J, Li Y, Bu Y, Wang Y. The Applications of Large Language Models in Mental Health: Scoping Review. J Med Internet Res 2025; 27:e69284. [PMID: 40324177 DOI: 10.2196/69284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Revised: 02/09/2025] [Accepted: 03/14/2025] [Indexed: 05/07/2025] Open
Abstract
BACKGROUND Mental health is emerging as an increasingly prevalent public issue globally. There is an urgent need in mental health for efficient detection methods, effective treatments, affordable privacy-focused health care solutions, and increased access to specialized psychiatrists. The emergence and rapid development of large language models (LLMs) have shown the potential to address these mental health demands. However, a comprehensive review summarizing the application areas, processes, and performance comparisons of LLMs in mental health has been lacking until now. OBJECTIVE This review aimed to summarize the applications of LLMs in mental health, including trends, application areas, performance comparisons, challenges, and prospective future directions. METHODS A scoping review was conducted to map the landscape of LLMs' applications in mental health, including trends, application areas, comparative performance, and future trajectories. We searched 7 electronic databases, including Web of Science, PubMed, Cochrane Library, IEEE Xplore, Weipu, CNKI, and Wanfang, from January 1, 2019, to August 31, 2024. Studies eligible for inclusion were peer-reviewed articles focused on LLMs' applications in mental health. Studies were excluded if they (1) were not peer-reviewed or did not focus on mental health or mental disorders or (2) did not use LLMs; studies that used only natural language processing or long short-term memory models were also excluded. Relevant information on application details and performance metrics was extracted during the data charting of eligible articles. RESULTS A total of 95 articles were drawn from 4859 studies using LLMs for mental health tasks. The applications were categorized into 3 key areas: screening or detection of mental disorders (67/95, 71%), supporting clinical treatments and interventions (31/95, 33%), and assisting in mental health counseling and education (11/95, 12%). Most studies used LLMs for depression detection and classification (33/95, 35%), clinical treatment support and intervention (14/95, 15%), and suicide risk prediction (12/95, 13%). Compared with nontransformer models and humans, LLMs demonstrate higher capabilities in information acquisition and analysis and efficiently generating natural language responses. Various series of LLMs also have different advantages and disadvantages in addressing mental health tasks. CONCLUSIONS This scoping review synthesizes the applications, processes, performance, and challenges of LLMs in the mental health field. These findings highlight the substantial potential of LLMs to augment mental health research, diagnostics, and intervention strategies, underscoring the imperative for ongoing development and ethical deliberation in clinical settings.
Collapse
Affiliation(s)
- Yu Jin
- Department of Statistics, Faculty of Arts and Sciences, Beijing Normal University, Beijing, China
| | - Jiayi Liu
- Department of Statistics, Faculty of Arts and Sciences, Beijing Normal University, Beijing, China
| | - Pan Li
- Department of Statistics, Faculty of Arts and Sciences, Beijing Normal University, Beijing, China
| | - Baosen Wang
- Department of Statistics, Faculty of Arts and Sciences, Beijing Normal University, Beijing, China
| | - Yangxinyu Yan
- School of Psychology, Center for Studies of Psychological Application, and Guangdong Key Laboratory of Mental Health and Cognitive Science, Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, South China Normal University, Guangzhou, Guangdong, China
| | - Huilin Zhang
- School of Statistics, Beijing Normal University, Beijing, China
| | - Chenhao Ni
- School of Statistics, Beijing Normal University, Beijing, China
| | - Jing Wang
- Faculty of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, Guangdong, China
| | - Yi Li
- The People's Hospital of Pingbian County, Honghe, Yunnan, China
| | - Yajun Bu
- The People's Hospital of Pingbian County, Honghe, Yunnan, China
| | - Yuanyuan Wang
- School of Psychology, Center for Studies of Psychological Application, and Guangdong Key Laboratory of Mental Health and Cognitive Science, Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, South China Normal University, Guangzhou, Guangdong, China
| |
Collapse
|
3
|
Levkovich I, Rabin E, Farraj RH, Elyoseph Z. Attributional patterns toward students with and without learning disabilities: Artificial intelligence models vs. trainee teachers. RESEARCH IN DEVELOPMENTAL DISABILITIES 2025; 160:104970. [PMID: 40090118 DOI: 10.1016/j.ridd.2025.104970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 01/31/2025] [Accepted: 03/06/2025] [Indexed: 03/18/2025]
Abstract
This study explored differences in the attributional patterns of four advanced artificial intelligence (AI) Large Language Models (LLMs): ChatGPT3.5, ChatGPT4, Claude, and Gemini) by focusing on feedback, frustration, sympathy, and expectations of future failure among students with and without learning disabilities (LD). These findings were compared with responses from a sample of Australian and Chinese trainee teachers, comprising individuals nearing qualification with varied demographic and educational backgrounds. Eight vignettes depicting students with varying abilities and efforts were evaluated by the LLMs ten times each, resulting in 320 evaluations, with trainee teachers providing comparable ratings. For LD students, the LLMs exhibited lower frustration and higher sympathy than trainee teachers, while for non-LD students, LLMs similarly showed lower frustration, with ChatGPT3.5 aligning closely with Chinese teachers and ChatGPT4 demonstrating more sympathy than both teacher groups. Notably, LLMs expressed lower expectations of future academic failure for both LD and non-LD students compared to trainee teachers. Regarding feedback, the findings reflect ratings of the qualitative nature of feedback LLMs and teachers would provide, rather than actual feedback text. The LLMs, particularly ChatGPT3.5 and Gemini, were rated as providing more negative feedback than trainee teachers, while ChatGPT4 provided more positive ratings for both LD and non-LD students, aligning with Chinese teachers in some cases. These findings suggest that LLMs may promote a positive and inclusive outlook for LD students by exhibiting lower judgmental tendencies and higher optimism. However, their tendency to rate feedback more negatively than trainee teachers highlights the need to recalibrate AI tools to better align with cultural and emotional nuances.
Collapse
Affiliation(s)
- Inbar Levkovich
- Faculty of Education, Tel Hai College, Upper Galilee, Israel.
| | - Eyal Rabin
- Department of Psychology and Education, The Open University of Israel, Israel.
| | | | - Zohar Elyoseph
- University of Haifa, Mount Carmel, Haifa, Israel; Department of Brain Sciences, Faculty of Medicine, Imperial College London, UK.
| |
Collapse
|
4
|
Hadar-Shoval D, Lvovsky M, Asraf K, Shimoni Y, Elyoseph Z. The Feasibility of Large Language Models in Verbal Comprehension Assessment: Mixed Methods Feasibility Study. JMIR Form Res 2025; 9:e68347. [PMID: 39993720 PMCID: PMC11894350 DOI: 10.2196/68347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Revised: 01/04/2025] [Accepted: 01/05/2025] [Indexed: 02/26/2025] Open
Abstract
BACKGROUND Cognitive assessment is an important component of applied psychology, but limited access and high costs make these evaluations challenging. OBJECTIVE This study aimed to examine the feasibility of using large language models (LLMs) to create personalized artificial intelligence-based verbal comprehension tests (AI-BVCTs) for assessing verbal intelligence, in contrast with traditional assessment methods based on standardized norms. METHODS We used a within-participants design, comparing scores obtained from AI-BVCTs with those from the Wechsler Adult Intelligence Scale (WAIS-III) verbal comprehension index (VCI). In total, 8 Hebrew-speaking participants completed both the VCI and AI-BVCT, the latter being generated using the LLM Claude. RESULTS The concordance correlation coefficient (CCC) demonstrated strong agreement between AI-BVCT and VCI scores (Claude: CCC=.75, 90% CI 0.266-0.933; GPT-4: CCC=.73, 90% CI 0.170-0.935). Pearson correlations further supported these findings, showing strong associations between VCI and AI-BVCT scores (Claude: r=.84, P<.001; GPT-4: r=.77, P=.02). No statistically significant differences were found between AI-BVCT and VCI scores (P>.05). CONCLUSIONS These findings support the potential of LLMs to assess verbal intelligence. The study attests to the promise of AI-based cognitive tests in increasing the accessibility and affordability of assessment processes, enabling personalized testing. The research also raises ethical concerns regarding privacy and overreliance on AI in clinical work. Further research with larger and more diverse samples is needed to establish the validity and reliability of this approach and develop more accurate scoring procedures.
Collapse
Affiliation(s)
- Dorit Hadar-Shoval
- Department of Psychology, Max Stern Academic College of Emek Yezreel, Afula, Israel
| | - Maya Lvovsky
- Department of Psychology, Max Stern Academic College of Emek Yezreel, Afula, Israel
| | - Kfir Asraf
- Department of Psychology, Max Stern Academic College of Emek Yezreel, Afula, Israel
| | - Yoav Shimoni
- School of Psychology, University of Haifa, Haifa, Israel
- Department of Education and Psychology, Open University of Israel, Raanana, Israel
| | - Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
- School of Counseling and Human Development, Faculty of Education, University of Haifa, Haifa, Israel
| |
Collapse
|
5
|
Haber Y, Hadar Shoval D, Levkovich I, Yinon D, Gigi K, Pen O, Angert T, Elyoseph Z. The externalization of internal experiences in psychotherapy through generative artificial intelligence: a theoretical, clinical, and ethical analysis. Front Digit Health 2025; 7:1512273. [PMID: 39968063 PMCID: PMC11832678 DOI: 10.3389/fdgth.2025.1512273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Accepted: 01/14/2025] [Indexed: 02/20/2025] Open
Abstract
Introduction Externalization techniques are well established in psychotherapy approaches, including narrative therapy and cognitive behavioral therapy. These methods elicit internal experiences such as emotions and make them tangible through external representations. Recent advances in generative artificial intelligence (GenAI), specifically large language models (LLMs), present new possibilities for therapeutic interventions; however, their integration into core psychotherapy practices remains largely unexplored. This study aimed to examine the clinical, ethical, and theoretical implications of integrating GenAI into the therapeutic space through a proof-of-concept (POC) of AI-driven externalization techniques, while emphasizing the essential role of the human therapist. Methods To this end, we developed two customized GPTs agents: VIVI (visual externalization), which uses DALL-E 3 to create images reflecting patients' internal experiences (e.g., depression or hope), and DIVI (dialogic role-play-based externalization), which simulates conversations with aspects of patients' internal content. These tools were implemented and evaluated through a clinical case study under professional psychological guidance. Results The integration of VIVI and DIVI demonstrated that GenAI can serve as an "artificial third", creating a Winnicottian playful space that enhances, rather than supplants, the dyadic therapist-patient relationship. The tools successfully externalized complex internal dynamics, offering new therapeutic avenues, while also revealing challenges such as empathic failures and cultural biases. Discussion These findings highlight both the promise and the ethical complexities of AI-enhanced therapy, including concerns about data security, representation accuracy, and the balance of clinical authority. To address these challenges, we propose the SAFE-AI protocol, offering clinicians structured guidelines for responsible AI integration in therapy. Future research should systematically evaluate the generalizability, efficacy, and ethical implications of these tools across diverse populations and therapeutic contexts.
Collapse
Affiliation(s)
- Yuval Haber
- The Program of Hermeneutics and Cultural Studies, Interdisciplinary Studies Unit, Bar-Ilan University, Jerusalem, Israel
| | - Dorit Hadar Shoval
- Department of Psychology, Max Stern Academic College of Emek Yezreel, Yezreel Valley, Israel
| | - Inbar Levkovich
- Faculty of Education, Tel-Hai Academic College, Kiryat Shmona, Israel
| | - Dror Yinon
- The Program of Hermeneutics and Cultural Studies, Interdisciplinary Studies Unit, Bar-Ilan University, Jerusalem, Israel
| | - Karny Gigi
- Department of Counseling and Human Development, Faculty of Education, University of Haifa, Haifa, Israel
| | - Oori Pen
- Department of Counseling and Human Development, Faculty of Education, University of Haifa, Haifa, Israel
| | - Tal Angert
- Department of Counseling and Human Development, Faculty of Education, University of Haifa, Haifa, Israel
| | - Zohar Elyoseph
- Department of Counseling and Human Development, Faculty of Education, University of Haifa, Haifa, Israel
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
6
|
Kolding S, Lundin RM, Hansen L, Østergaard SD. Use of generative artificial intelligence (AI) in psychiatry and mental health care: a systematic review. Acta Neuropsychiatr 2024; 37:e37. [PMID: 39523628 DOI: 10.1017/neu.2024.50] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
OBJECTIVES Tools based on generative artificial intelligence (AI) such as ChatGPT have the potential to transform modern society, including the field of medicine. Due to the prominent role of language in psychiatry, e.g., for diagnostic assessment and psychotherapy, these tools may be particularly useful within this medical field. Therefore, the aim of this study was to systematically review the literature on generative AI applications in psychiatry and mental health. METHODS We conducted a systematic review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. The search was conducted across three databases, and the resulting articles were screened independently by two researchers. The content, themes, and findings of the articles were qualitatively assessed. RESULTS The search and screening process resulted in the inclusion of 40 studies. The median year of publication was 2023. The themes covered in the articles were mainly mental health and well-being in general - with less emphasis on specific mental disorders (substance use disorder being the most prevalent). The majority of studies were conducted as prompt experiments, with the remaining studies comprising surveys, pilot studies, and case reports. Most studies focused on models that generate language, ChatGPT in particular. CONCLUSIONS Generative AI in psychiatry and mental health is a nascent but quickly expanding field. The literature mainly focuses on applications of ChatGPT, and finds that generative AI performs well, but notes that it is limited by significant safety and ethical concerns. Future research should strive to enhance transparency of methods, use experimental designs, ensure clinical relevance, and involve users/patients in the design phase.
Collapse
Affiliation(s)
- Sara Kolding
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
- Department of Affective Disorders, Aarhus University Hospital - Psychiatry, Aarhus, Denmark
- Center for Humanities Computing, Aarhus University, Aarhus, Denmark
| | - Robert M Lundin
- Deakin University, Institute for Mental and Physical Health and Clinical Translation (IMPACT), Geelong, VIC, Australia
- Mildura Base Public Hospital, Mental Health Services, Alcohol and Other Drugs Integrated Treatment Team, Mildura, VIC, Australia
- Barwon Health, Change to Improve Mental Health (CHIME), Mental Health Drugs and Alcohol Services, Geelong, VIC, Australia
| | - Lasse Hansen
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
- Department of Affective Disorders, Aarhus University Hospital - Psychiatry, Aarhus, Denmark
- Center for Humanities Computing, Aarhus University, Aarhus, Denmark
| | - Søren Dinesen Østergaard
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
- Department of Affective Disorders, Aarhus University Hospital - Psychiatry, Aarhus, Denmark
| |
Collapse
|
7
|
Guo Z, Lai A, Thygesen JH, Farrington J, Keen T, Li K. Large Language Models for Mental Health Applications: Systematic Review. JMIR Ment Health 2024; 11:e57400. [PMID: 39423368 PMCID: PMC11530718 DOI: 10.2196/57400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 05/17/2024] [Accepted: 09/03/2024] [Indexed: 10/21/2024] Open
Abstract
BACKGROUND Large language models (LLMs) are advanced artificial neural networks trained on extensive datasets to accurately understand and generate natural language. While they have received much attention and demonstrated potential in digital health, their application in mental health, particularly in clinical settings, has generated considerable debate. OBJECTIVE This systematic review aims to critically assess the use of LLMs in mental health, specifically focusing on their applicability and efficacy in early screening, digital interventions, and clinical settings. By systematically collating and assessing the evidence from current studies, our work analyzes models, methodologies, data sources, and outcomes, thereby highlighting the potential of LLMs in mental health, the challenges they present, and the prospects for their clinical use. METHODS Adhering to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, this review searched 5 open-access databases: MEDLINE (accessed by PubMed), IEEE Xplore, Scopus, JMIR, and ACM Digital Library. Keywords used were (mental health OR mental illness OR mental disorder OR psychiatry) AND (large language models). This study included articles published between January 1, 2017, and April 30, 2024, and excluded articles published in languages other than English. RESULTS In total, 40 articles were evaluated, including 15 (38%) articles on mental health conditions and suicidal ideation detection through text analysis, 7 (18%) on the use of LLMs as mental health conversational agents, and 18 (45%) on other applications and evaluations of LLMs in mental health. LLMs show good effectiveness in detecting mental health issues and providing accessible, destigmatized eHealth services. However, assessments also indicate that the current risks associated with clinical use might surpass their benefits. These risks include inconsistencies in generated text; the production of hallucinations; and the absence of a comprehensive, benchmarked ethical framework. CONCLUSIONS This systematic review examines the clinical applications of LLMs in mental health, highlighting their potential and inherent risks. The study identifies several issues: the lack of multilingual datasets annotated by experts, concerns regarding the accuracy and reliability of generated content, challenges in interpretability due to the "black box" nature of LLMs, and ongoing ethical dilemmas. These ethical concerns include the absence of a clear, benchmarked ethical framework; data privacy issues; and the potential for overreliance on LLMs by both physicians and patients, which could compromise traditional medical practices. As a result, LLMs should not be considered substitutes for professional mental health services. However, the rapid development of LLMs underscores their potential as valuable clinical aids, emphasizing the need for continued research and development in this area. TRIAL REGISTRATION PROSPERO CRD42024508617; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=508617.
Collapse
Affiliation(s)
- Zhijun Guo
- Institute of Health Informatics University College, London, London, United Kingdom
| | - Alvina Lai
- Institute of Health Informatics University College, London, London, United Kingdom
| | - Johan H Thygesen
- Institute of Health Informatics University College, London, London, United Kingdom
| | - Joseph Farrington
- Institute of Health Informatics University College, London, London, United Kingdom
| | - Thomas Keen
- Institute of Health Informatics University College, London, London, United Kingdom
- Great Ormond Street Institute of Child Health, University College London, London, United Kingdom
| | - Kezhi Li
- Institute of Health Informatics University College, London, London, United Kingdom
| |
Collapse
|
8
|
Elyoseph Z, Gur T, Haber Y, Simon T, Angert T, Navon Y, Tal A, Asman O. An Ethical Perspective on the Democratization of Mental Health With Generative AI. JMIR Ment Health 2024; 11:e58011. [PMID: 39417792 PMCID: PMC11500620 DOI: 10.2196/58011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 07/20/2024] [Accepted: 07/24/2024] [Indexed: 10/19/2024] Open
Abstract
Unlabelled Knowledge has become more open and accessible to a large audience with the "democratization of information" facilitated by technology. This paper provides a sociohistorical perspective for the theme issue "Responsible Design, Integration, and Use of Generative AI in Mental Health." It evaluates ethical considerations in using generative artificial intelligence (GenAI) for the democratization of mental health knowledge and practice. It explores the historical context of democratizing information, transitioning from restricted access to widespread availability due to the internet, open-source movements, and most recently, GenAI technologies such as large language models. The paper highlights why GenAI technologies represent a new phase in the democratization movement, offering unparalleled access to highly advanced technology as well as information. In the realm of mental health, this requires delicate and nuanced ethical deliberation. Including GenAI in mental health may allow, among other things, improved accessibility to mental health care, personalized responses, and conceptual flexibility, and could facilitate a flattening of traditional hierarchies between health care providers and patients. At the same time, it also entails significant risks and challenges that must be carefully addressed. To navigate these complexities, the paper proposes a strategic questionnaire for assessing artificial intelligence-based mental health applications. This tool evaluates both the benefits and the risks, emphasizing the need for a balanced and ethical approach to GenAI integration in mental health. The paper calls for a cautious yet positive approach to GenAI in mental health, advocating for the active engagement of mental health professionals in guiding GenAI development. It emphasizes the importance of ensuring that GenAI advancements are not only technologically sound but also ethically grounded and patient-centered.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College, Fulham Palace Rd, London, W6 8RF, United Kingdom, 44 547836088
- Faculty of Education, University of Haifa, Haifa, Israel
| | - Tamar Gur
- The Adelson School of Entrepreneurship, Reichman University, Herzliya, Israel
| | - Yuval Haber
- The PhD Program of Hermeneutics & Cultural Studies, Bar-Ilan University, Ramat Gan, Israel
| | - Tomer Simon
- Microsoft Israel R&D Center, Tel Aviv, Israel
| | - Tal Angert
- Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| | - Yuval Navon
- Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| | - Amir Tal
- Samueli Initiative for Responsible AI in Medicine, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Oren Asman
- Samueli Initiative for Responsible AI in Medicine, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Department of Nursing, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
9
|
Hadar-Shoval D, Asraf K, Shinan-Altman S, Elyoseph Z, Levkovich I. Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas. Heliyon 2024; 10:e38056. [PMID: 39381244 PMCID: PMC11458949 DOI: 10.1016/j.heliyon.2024.e38056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Accepted: 09/17/2024] [Indexed: 10/10/2024] Open
Abstract
Objective This article uses the framework of Schwartz's values theory to examine whether the embedded values-like profile within large language models (LLMs) impact ethical decision-making dilemmas faced by primary care. It specifically aims to evaluate whether each LLM exhibits a distinct values-like profile, assess its alignment with general population values, and determine whether latent values influence clinical recommendations. Methods The Portrait Values Questionnaire-Revised (PVQ-RR) was submitted to each LLM (Claude, Bard, GPT-3.5, and GPT-4) 20 times to ensure reliable and valid responses. Their responses were compared to a benchmark derived from a diverse international sample consisting of over 53,000 culturally diverse respondents who completed the PVQ-RR. Four vignettes depicting prototypical professional quandaries involving conflicts between competing values were presented to the LLMs. The option selected by each LLM and the strength of its recommendation were evaluated to determine if underlying values-like impact output. Results Each LLM demonstrated a unique values-like profile. Universalism and self-direction were prioritized, while power and tradition were assigned less importance than population benchmarks, suggesting potential Western-centric biases. Four clinical vignettes involving value conflicts were presented to the LLMs. Preliminary indications suggested that embedded values-like influence recommendations. Significant variances in confidence strength regarding chosen recommendations materialized between models, proposing that further vetting is required before the LLMs can be relied on as judgment aids. However, the overall selection of preferences aligned with intrinsic value hierarchies. Conclusion The distinct intrinsic values-like embedded within LLMs shape ethical decision-making, which carries implications for their integration in primary care settings serving diverse populations. For context-appropriate, equitable delivery of AI-assisted healthcare globally it is essential that LLMs are tailored to align with cultural outlooks.
Collapse
Affiliation(s)
- Dorit Hadar-Shoval
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Israel
| | - Kfir Asraf
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Israel
| | - Shiri Shinan-Altman
- The Louis and Gabi Weisfeld School of Social Work, Bar-Ilan University, Ramat Gan, Israel
| | - Zohar Elyoseph
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Israel
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, England
- Department of Counseling and Human Development, Department of Education, University of Haifa, Israel
| | | |
Collapse
|
10
|
Tavory T. Regulating AI in Mental Health: Ethics of Care Perspective. JMIR Ment Health 2024; 11:e58493. [PMID: 39298759 PMCID: PMC11450345 DOI: 10.2196/58493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Revised: 06/29/2024] [Accepted: 07/20/2024] [Indexed: 09/22/2024] Open
Abstract
This article contends that the responsible artificial intelligence (AI) approach-which is the dominant ethics approach ruling most regulatory and ethical guidance-falls short because it overlooks the impact of AI on human relationships. Focusing only on responsible AI principles reinforces a narrow concept of accountability and responsibility of companies developing AI. This article proposes that applying the ethics of care approach to AI regulation can offer a more comprehensive regulatory and ethical framework that addresses AI's impact on human relationships. This dual approach is essential for the effective regulation of AI in the domain of mental health care. The article delves into the emergence of the new "therapeutic" area facilitated by AI-based bots, which operate without a therapist. The article highlights the difficulties involved, mainly the absence of a defined duty of care toward users, and shows how implementing ethics of care can establish clear responsibilities for developers. It also sheds light on the potential for emotional manipulation and the risks involved. In conclusion, the article proposes a series of considerations grounded in the ethics of care for the developmental process of AI-powered therapeutic tools.
Collapse
Affiliation(s)
- Tamar Tavory
- Faculty of Law, Bar Ilan University, Ramat Gan, Israel
- The Samueli Initiative for Responsible AI in Medicine, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
11
|
Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, Klang E. Applications of large language models in psychiatry: a systematic review. Front Psychiatry 2024; 15:1422807. [PMID: 38979501 PMCID: PMC11228775 DOI: 10.3389/fpsyt.2024.1422807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Accepted: 06/05/2024] [Indexed: 07/10/2024] Open
Abstract
Background With their unmatched ability to interpret and engage with human language and context, large language models (LLMs) hint at the potential to bridge AI and human cognitive processes. This review explores the current application of LLMs, such as ChatGPT, in the field of psychiatry. Methods We followed PRISMA guidelines and searched through PubMed, Embase, Web of Science, and Scopus, up until March 2024. Results From 771 retrieved articles, we included 16 that directly examine LLMs' use in psychiatry. LLMs, particularly ChatGPT and GPT-4, showed diverse applications in clinical reasoning, social media, and education within psychiatry. They can assist in diagnosing mental health issues, managing depression, evaluating suicide risk, and supporting education in the field. However, our review also points out their limitations, such as difficulties with complex cases and potential underestimation of suicide risks. Conclusion Early research in psychiatry reveals LLMs' versatile applications, from diagnostic support to educational roles. Given the rapid pace of advancement, future investigations are poised to explore the extent to which these models might redefine traditional roles in mental health care.
Collapse
Affiliation(s)
- Mahmud Omar
- Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel
| | - Shelly Soffer
- Internal Medicine B, Assuta Medical Center, Ashdod, Israel
- Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | | | - Isotta Landi
- Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Girish N Nadkarni
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Eyal Klang
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
12
|
Shinan-Altman S, Elyoseph Z, Levkovich I. The impact of history of depression and access to weapons on suicide risk assessment: a comparison of ChatGPT-3.5 and ChatGPT-4. PeerJ 2024; 12:e17468. [PMID: 38827287 PMCID: PMC11143969 DOI: 10.7717/peerj.17468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 05/05/2024] [Indexed: 06/04/2024] Open
Abstract
The aim of this study was to evaluate the effectiveness of ChatGPT-3.5 and ChatGPT-4 in incorporating critical risk factors, namely history of depression and access to weapons, into suicide risk assessments. Both models assessed suicide risk using scenarios that featured individuals with and without a history of depression and access to weapons. The models estimated the likelihood of suicidal thoughts, suicide attempts, serious suicide attempts, and suicide-related mortality on a Likert scale. A multivariate three-way ANOVA analysis with Bonferroni post hoc tests was conducted to examine the impact of the forementioned independent factors (history of depression and access to weapons) on these outcome variables. Both models identified history of depression as a significant suicide risk factor. ChatGPT-4 demonstrated a more nuanced understanding of the relationship between depression, access to weapons, and suicide risk. In contrast, ChatGPT-3.5 displayed limited insight into this complex relationship. ChatGPT-4 consistently assigned higher severity ratings to suicide-related variables than did ChatGPT-3.5. The study highlights the potential of these two models, particularly ChatGPT-4, to enhance suicide risk assessment by considering complex risk factors.
Collapse
Affiliation(s)
| | - Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, England, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College of Education, Kiryat Tiv’on, Israel
| |
Collapse
|
13
|
Haber Y, Levkovich I, Hadar-Shoval D, Elyoseph Z. The Artificial Third: A Broad View of the Effects of Introducing Generative Artificial Intelligence on Psychotherapy. JMIR Ment Health 2024; 11:e54781. [PMID: 38787297 PMCID: PMC11137430 DOI: 10.2196/54781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 03/24/2024] [Accepted: 04/18/2024] [Indexed: 05/25/2024] Open
Abstract
Unlabelled This paper explores a significant shift in the field of mental health in general and psychotherapy in particular following generative artificial intelligence's new capabilities in processing and generating humanlike language. Following Freud, this lingo-technological development is conceptualized as the "fourth narcissistic blow" that science inflicts on humanity. We argue that this narcissistic blow has a potentially dramatic influence on perceptions of human society, interrelationships, and the self. We should, accordingly, expect dramatic changes in perceptions of the therapeutic act following the emergence of what we term the artificial third in the field of psychotherapy. The introduction of an artificial third marks a critical juncture, prompting us to ask the following important core questions that address two basic elements of critical thinking, namely, transparency and autonomy: (1) What is this new artificial presence in therapy relationships? (2) How does it reshape our perception of ourselves and our interpersonal dynamics? and (3) What remains of the irreplaceable human elements at the core of therapy? Given the ethical implications that arise from these questions, this paper proposes that the artificial third can be a valuable asset when applied with insight and ethical consideration, enhancing but not replacing the human touch in therapy.
Collapse
Affiliation(s)
- Yuval Haber
- The PhD Program of Hermeneutics and Cultural Studies, Interdisciplinary Studies Unit, Bar-Ilan University, Ramat Gan, Israel
| | | | - Dorit Hadar-Shoval
- Department of Psychology and Educational Counseling, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| |
Collapse
|
14
|
Hadar-Shoval D, Asraf K, Mizrachi Y, Haber Y, Elyoseph Z. Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values. JMIR Ment Health 2024; 11:e55988. [PMID: 38593424 PMCID: PMC11040439 DOI: 10.2196/55988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/01/2024] [Accepted: 03/08/2024] [Indexed: 04/11/2024] Open
Abstract
BACKGROUND Large language models (LLMs) hold potential for mental health applications. However, their opaque alignment processes may embed biases that shape problematic perspectives. Evaluating the values embedded within LLMs that guide their decision-making have ethical importance. Schwartz's theory of basic values (STBV) provides a framework for quantifying cultural value orientations and has shown utility for examining values in mental health contexts, including cultural, diagnostic, and therapist-client dynamics. OBJECTIVE This study aimed to (1) evaluate whether the STBV can measure value-like constructs within leading LLMs and (2) determine whether LLMs exhibit distinct value-like patterns from humans and each other. METHODS In total, 4 LLMs (Bard, Claude 2, Generative Pretrained Transformer [GPT]-3.5, GPT-4) were anthropomorphized and instructed to complete the Portrait Values Questionnaire-Revised (PVQ-RR) to assess value-like constructs. Their responses over 10 trials were analyzed for reliability and validity. To benchmark the LLMs' value profiles, their results were compared to published data from a diverse sample of 53,472 individuals across 49 nations who had completed the PVQ-RR. This allowed us to assess whether the LLMs diverged from established human value patterns across cultural groups. Value profiles were also compared between models via statistical tests. RESULTS The PVQ-RR showed good reliability and validity for quantifying value-like infrastructure within the LLMs. However, substantial divergence emerged between the LLMs' value profiles and population data. The models lacked consensus and exhibited distinct motivational biases, reflecting opaque alignment processes. For example, all models prioritized universalism and self-direction, while de-emphasizing achievement, power, and security relative to humans. Successful discriminant analysis differentiated the 4 LLMs' distinct value profiles. Further examination found the biased value profiles strongly predicted the LLMs' responses when presented with mental health dilemmas requiring choosing between opposing values. This provided further validation for the models embedding distinct motivational value-like constructs that shape their decision-making. CONCLUSIONS This study leveraged the STBV to map the motivational value-like infrastructure underpinning leading LLMs. Although the study demonstrated the STBV can effectively characterize value-like infrastructure within LLMs, substantial divergence from human values raises ethical concerns about aligning these models with mental health applications. The biases toward certain cultural value sets pose risks if integrated without proper safeguards. For example, prioritizing universalism could promote unconditional acceptance even when clinically unwise. Furthermore, the differences between the LLMs underscore the need to standardize alignment processes to capture true cultural diversity. Thus, any responsible integration of LLMs into mental health care must account for their embedded biases and motivation mismatches to ensure equitable delivery across diverse populations. Achieving this will require transparency and refinement of alignment techniques to instill comprehensive human values.
Collapse
Affiliation(s)
- Dorit Hadar-Shoval
- The Psychology Department, Max Stern Yezreel Valley College, Tel Adashim, Israel
| | - Kfir Asraf
- The Psychology Department, Max Stern Yezreel Valley College, Tel Adashim, Israel
| | - Yonathan Mizrachi
- The Jane Goodall Institute, Max Stern Yezreel Valley College, Tel Adashim, Israel
- The Laboratory for AI, Machine Learning, Business & Data Analytics, Tel-Aviv University, Tel Aviv, Israel
| | - Yuval Haber
- The PhD Program of Hermeneutics and Cultural Studies, Interdisciplinary Studies Unit, Bar-Ilan University, Ramat Gan, Israel
| | - Zohar Elyoseph
- The Psychology Department, Center for Psychobiological Research, Max Stern Yezreel Valley College, Tel Adashim, Israel
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
15
|
Elyoseph Z, Levkovich I. Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study. JMIR Ment Health 2024; 11:e53043. [PMID: 38533615 PMCID: PMC11004608 DOI: 10.2196/53043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 01/24/2024] [Accepted: 02/11/2024] [Indexed: 03/28/2024] Open
Abstract
Background The current paradigm in mental health care focuses on clinical recovery and symptom remission. This model's efficacy is influenced by therapist trust in patient recovery potential and the depth of the therapeutic relationship. Schizophrenia is a chronic illness with severe symptoms where the possibility of recovery is a matter of debate. As artificial intelligence (AI) becomes integrated into the health care field, it is important to examine its ability to assess recovery potential in major psychiatric disorders such as schizophrenia. Objective This study aimed to evaluate the ability of large language models (LLMs) in comparison to mental health professionals to assess the prognosis of schizophrenia with and without professional treatment and the long-term positive and negative outcomes. Methods Vignettes were inputted into LLMs interfaces and assessed 10 times by 4 AI platforms: ChatGPT-3.5, ChatGPT-4, Google Bard, and Claude. A total of 80 evaluations were collected and benchmarked against existing norms to analyze what mental health professionals (general practitioners, psychiatrists, clinical psychologists, and mental health nurses) and the general public think about schizophrenia prognosis with and without professional treatment and the positive and negative long-term outcomes of schizophrenia interventions. Results For the prognosis of schizophrenia with professional treatment, ChatGPT-3.5 was notably pessimistic, whereas ChatGPT-4, Claude, and Bard aligned with professional views but differed from the general public. All LLMs believed untreated schizophrenia would remain static or worsen without professional treatment. For long-term outcomes, ChatGPT-4 and Claude predicted more negative outcomes than Bard and ChatGPT-3.5. For positive outcomes, ChatGPT-3.5 and Claude were more pessimistic than Bard and ChatGPT-4. Conclusions The finding that 3 out of the 4 LLMs aligned closely with the predictions of mental health professionals when considering the "with treatment" condition is a demonstration of the potential of this technology in providing professional clinical prognosis. The pessimistic assessment of ChatGPT-3.5 is a disturbing finding since it may reduce the motivation of patients to start or persist with treatment for schizophrenia. Overall, although LLMs hold promise in augmenting health care, their application necessitates rigorous validation and a harmonious blend with human expertise.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College, Kiryat Tiv'on, Israel
| |
Collapse
|
16
|
Elyoseph Z, Refoua E, Asraf K, Lvovsky M, Shimoni Y, Hadar-Shoval D. Capacity of Generative AI to Interpret Human Emotions From Visual and Textual Data: Pilot Evaluation Study. JMIR Ment Health 2024; 11:e54369. [PMID: 38319707 PMCID: PMC10879976 DOI: 10.2196/54369] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 12/09/2023] [Accepted: 12/25/2023] [Indexed: 02/07/2024] Open
Abstract
BACKGROUND Mentalization, which is integral to human cognitive processes, pertains to the interpretation of one's own and others' mental states, including emotions, beliefs, and intentions. With the advent of artificial intelligence (AI) and the prominence of large language models in mental health applications, questions persist about their aptitude in emotional comprehension. The prior iteration of the large language model from OpenAI, ChatGPT-3.5, demonstrated an advanced capacity to interpret emotions from textual data, surpassing human benchmarks. Given the introduction of ChatGPT-4, with its enhanced visual processing capabilities, and considering Google Bard's existing visual functionalities, a rigorous assessment of their proficiency in visual mentalizing is warranted. OBJECTIVE The aim of the research was to critically evaluate the capabilities of ChatGPT-4 and Google Bard with regard to their competence in discerning visual mentalizing indicators as contrasted with their textual-based mentalizing abilities. METHODS The Reading the Mind in the Eyes Test developed by Baron-Cohen and colleagues was used to assess the models' proficiency in interpreting visual emotional indicators. Simultaneously, the Levels of Emotional Awareness Scale was used to evaluate the large language models' aptitude in textual mentalizing. Collating data from both tests provided a holistic view of the mentalizing capabilities of ChatGPT-4 and Bard. RESULTS ChatGPT-4, displaying a pronounced ability in emotion recognition, secured scores of 26 and 27 in 2 distinct evaluations, significantly deviating from a random response paradigm (P<.001). These scores align with established benchmarks from the broader human demographic. Notably, ChatGPT-4 exhibited consistent responses, with no discernible biases pertaining to the sex of the model or the nature of the emotion. In contrast, Google Bard's performance aligned with random response patterns, securing scores of 10 and 12 and rendering further detailed analysis redundant. In the domain of textual analysis, both ChatGPT and Bard surpassed established benchmarks from the general population, with their performances being remarkably congruent. CONCLUSIONS ChatGPT-4 proved its efficacy in the domain of visual mentalizing, aligning closely with human performance standards. Although both models displayed commendable acumen in textual emotion interpretation, Bard's capabilities in visual emotion interpretation necessitate further scrutiny and potential refinement. This study stresses the criticality of ethical AI development for emotional recognition, highlighting the need for inclusive data, collaboration with patients and mental health experts, and stringent governmental oversight to ensure transparency and protect patient privacy.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Educational Psychology, The Center for Psychobiological Research, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
- Imperial College London, London, United Kingdom
| | - Elad Refoua
- Department of Psychology, Bar-Ilan University, Ramat Gan, Israel
| | - Kfir Asraf
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Maya Lvovsky
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Yoav Shimoni
- Boston Children's Hospital, Boston, MA, United States
| | - Dorit Hadar-Shoval
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| |
Collapse
|
17
|
Elyoseph Z, Levkovich I, Shinan-Altman S. Assessing prognosis in depression: comparing perspectives of AI models, mental health professionals and the general public. Fam Med Community Health 2024; 12:e002583. [PMID: 38199604 PMCID: PMC10806564 DOI: 10.1136/fmch-2023-002583] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) has rapidly permeated various sectors, including healthcare, highlighting its potential to facilitate mental health assessments. This study explores the underexplored domain of AI's role in evaluating prognosis and long-term outcomes in depressive disorders, offering insights into how AI large language models (LLMs) compare with human perspectives. METHODS Using case vignettes, we conducted a comparative analysis involving different LLMs (ChatGPT-3.5, ChatGPT-4, Claude and Bard), mental health professionals (general practitioners, psychiatrists, clinical psychologists and mental health nurses), and the general public that reported previously. We evaluate the LLMs ability to generate prognosis, anticipated outcomes with and without professional intervention, and envisioned long-term positive and negative consequences for individuals with depression. RESULTS In most of the examined cases, the four LLMs consistently identified depression as the primary diagnosis and recommended a combined treatment of psychotherapy and antidepressant medication. ChatGPT-3.5 exhibited a significantly pessimistic prognosis distinct from other LLMs, professionals and the public. ChatGPT-4, Claude and Bard aligned closely with mental health professionals and the general public perspectives, all of whom anticipated no improvement or worsening without professional help. Regarding long-term outcomes, ChatGPT 3.5, Claude and Bard consistently projected significantly fewer negative long-term consequences of treatment than ChatGPT-4. CONCLUSIONS This study underscores the potential of AI to complement the expertise of mental health professionals and promote a collaborative paradigm in mental healthcare. The observation that three of the four LLMs closely mirrored the anticipations of mental health experts in scenarios involving treatment underscores the technology's prospective value in offering professional clinical forecasts. The pessimistic outlook presented by ChatGPT 3.5 is concerning, as it could potentially diminish patients' drive to initiate or continue depression therapy. In summary, although LLMs show potential in enhancing healthcare services, their utilisation requires thorough verification and a seamless integration with human judgement and skills.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Psychology and Educational Counseling, The Center for Psychobiological Research, Max Stern Yezreel Valley College, Yezreel Valley, Israel
- Department of Brain Sciences, Imperial College London, London, UK
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College, Tivon, Israel
| | - Shiri Shinan-Altman
- The Louis and Gabi Weisfeld School of Social Work, Bar-Ilan University, Ramat Gan, Tel Aviv, Israel
| |
Collapse
|
18
|
Levkovich I, Rabin E, Brann M, Elyoseph Z. Large language models outperform general practitioners in identifying complex cases of childhood anxiety. Digit Health 2024; 10:20552076241294182. [PMID: 39687523 PMCID: PMC11648044 DOI: 10.1177/20552076241294182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Accepted: 10/09/2024] [Indexed: 12/18/2024] Open
Abstract
Objective Anxiety is prevalent in childhood but often remains undiagnosed due to its physical manifestations and significant comorbidity. Despite the availability of effective treatments, including medication and psychotherapy, research indicates that physicians struggle to identify childhood anxiety, particularly in complex and challenging cases. This study aims to explore the potential effectiveness of artificial intelligence (AI) language models in diagnosing childhood anxiety compared to general practitioners (GPs). Methods During February 2024, we evaluated the ability of several large language models (LLMs; ChatGPT-3.5 and ChatGPT-4, Claude.AI, Gemini) to identify cases childhood anxiety disorder, compared with reports of GPs. Results AI tools exhibited significantly higher rates of identifying anxiety than GPs. Each AI tool accurately identified anxiety in at least one case: Claude.AI and Gemini identified at least four cases, ChatGPT-3 identified three cases, and ChatGPT-4 identified one or two cases. Additionally, 40% of GPs preferred to manage the cases within their practice, often with the help of a practice nurse, whereas AI tools generally recommended referral to specialized mental or somatic health services. Conclusion Preliminary findings indicate that LLMs, specifically Claude.AI and Gemini, exhibit notable diagnostic capabilities in identifying child anxiety, demonstrating a comparative advantage over GPs.
Collapse
Affiliation(s)
- Inbar Levkovich
- The Faculty of Education, Tel Hai College, Upper Galilee, Israel
| | - Eyal Rabin
- Department of Psychology and Education, The Open University of Israel, Ra'anana, Israel
| | - Michal Brann
- Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Afula, Israel
| | - Zohar Elyoseph
- Faculty of Education, University of Haifa, Haifa, Israel
| |
Collapse
|
19
|
Elyoseph Z, Hadar Shoval D, Levkovich I. Beyond Personhood: Ethical Paradigms in the Generative Artificial Intelligence Era. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2024; 24:57-59. [PMID: 38236857 DOI: 10.1080/15265161.2023.2278546] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2024]
Affiliation(s)
- Zohar Elyoseph
- The Max Stern Yezreel Valley College
- Imperial College, London
| | | | | |
Collapse
|
20
|
Levkovich I, Elyoseph Z. Identifying depression and its determinants upon initiating treatment: ChatGPT versus primary care physicians. Fam Med Community Health 2023; 11:e002391. [PMID: 37844967 PMCID: PMC10582915 DOI: 10.1136/fmch-2023-002391] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2023] Open
Abstract
OBJECTIVE To compare evaluations of depressive episodes and suggested treatment protocols generated by Chat Generative Pretrained Transformer (ChatGPT)-3 and ChatGPT-4 with the recommendations of primary care physicians. METHODS Vignettes were input to the ChatGPT interface. These vignettes focused primarily on hypothetical patients with symptoms of depression during initial consultations. The creators of these vignettes meticulously designed eight distinct versions in which they systematically varied patient attributes (sex, socioeconomic status (blue collar worker or white collar worker) and depression severity (mild or severe)). Each variant was subsequently introduced into ChatGPT-3.5 and ChatGPT-4. Each vignette was repeated 10 times to ensure consistency and reliability of the ChatGPT responses. RESULTS For mild depression, ChatGPT-3.5 and ChatGPT-4 recommended psychotherapy in 95.0% and 97.5% of cases, respectively. Primary care physicians, however, recommended psychotherapy in only 4.3% of cases. For severe cases, ChatGPT favoured an approach that combined psychotherapy, while primary care physicians recommended a combined approach. The pharmacological recommendations of ChatGPT-3.5 and ChatGPT-4 showed a preference for exclusive use of antidepressants (74% and 68%, respectively), in contrast with primary care physicians, who typically recommended a mix of antidepressants and anxiolytics/hypnotics (67.4%). Unlike primary care physicians, ChatGPT showed no gender or socioeconomic biases in its recommendations. CONCLUSION ChatGPT-3.5 and ChatGPT-4 aligned well with accepted guidelines for managing mild and severe depression, without showing the gender or socioeconomic biases observed among primary care physicians. Despite the suggested potential benefit of using atificial intelligence (AI) chatbots like ChatGPT to enhance clinical decision making, further research is needed to refine AI recommendations for severe cases and to consider potential risks and ethical issues.
Collapse
Affiliation(s)
| | - Zohar Elyoseph
- Department of Psychology and Educational Counseling, Max Stern Academic College Of Emek Yezreel, Emek Yezreel, Israel
- Department of Brain Sciences, Imperial College London, London, UK
| |
Collapse
|