1
|
Das M, Ghosh A, Sunoj RB. Advances in machine learning with chemical language models in molecular property and reaction outcome predictions. J Comput Chem 2024; 45:1160-1176. [PMID: 38299229 DOI: 10.1002/jcc.27315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 01/06/2024] [Accepted: 01/09/2024] [Indexed: 02/02/2024]
Abstract
Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well-developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro-synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost-effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high-quality datasets.
Collapse
Affiliation(s)
- Manajit Das
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Ankit Ghosh
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai, India
| |
Collapse
|
2
|
Mishra V, Sarraju A, Kalwani NM, Dexter JP. Evaluation of Prompts to Simplify Cardiovascular Disease Information Generated Using a Large Language Model: Cross-Sectional Study. J Med Internet Res 2024; 26:e55388. [PMID: 38648104 DOI: 10.2196/55388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 01/25/2024] [Accepted: 01/31/2024] [Indexed: 04/25/2024] Open
Abstract
In this cross-sectional study, we evaluated the completeness, readability, and syntactic complexity of cardiovascular disease prevention information produced by GPT-4 in response to 4 kinds of prompts.
Collapse
Affiliation(s)
- Vishala Mishra
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, United States
| | - Ashish Sarraju
- Department of Cardiovascular Medicine, Cleveland Clinic, Cleveland, OH, United States
| | - Neil M Kalwani
- Veterans Affairs Palo Alto Health Care System, Palo Alto, CA, United States
- Division of Cardiovascular Medicine and the Cardiovascular Institute, Department of Medicine, Stanford University School of Medicine, Stanford, CA, United States
| | - Joseph P Dexter
- Data Science Initiative, Harvard University, Allston, MA, United States
- Department of Human Evolutionary Biology, Harvard University, Cambridge, MA, United States
- Institute of Collaborative Innovation, University of Macau, Taipa, Macao
| |
Collapse
|
3
|
Buehler MJ. Generative Retrieval-Augmented Ontologic Graph and Multiagent Strategies for Interpretive Large Language Model-Based Materials Design. ACS Eng Au 2024; 4:241-277. [PMID: 38646516 PMCID: PMC11027160 DOI: 10.1021/acsengineeringau.3c00058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 12/06/2023] [Accepted: 12/07/2023] [Indexed: 04/23/2024]
Abstract
Transformer neural networks show promising capabilities, in particular for uses in materials analysis, design, and manufacturing, including their capacity to work effectively with human language, symbols, code, and numerical data. Here, we explore the use of large language models (LLMs) as a tool that can support engineering analysis of materials, applied to retrieving key information about subject areas, developing research hypotheses, discovery of mechanistic relationships across disparate areas of knowledge, and writing and executing simulation codes for active knowledge generation based on physical ground truths. Moreover, when used as sets of AI agents with specific features, capabilities, and instructions, LLMs can provide powerful problem-solution strategies for applications in analysis and design problems. Our experiments focus on using a fine-tuned model, MechGPT, developed based on training data in the mechanics of materials domain. We first affirm how fine-tuning endows LLMs with a reasonable understanding of subject area knowledge. However, when queried outside the context of learned matter, LLMs can have difficulty recalling correct information and may hallucinate. We show how this can be addressed using retrieval-augmented Ontological Knowledge Graph strategies. The graph-based strategy helps us not only to discern how the model understands what concepts are important but also how they are related, which significantly improves generative performance and also naturally allows for injection of new and augmented data sources into generative AI algorithms. We find that the additional feature of relatedness provides advantages over regular retrieval augmentation approaches and not only improves LLM performance but also provides mechanistic insights for exploration of a material design process. Illustrated for a use case of relating distinct areas of knowledge, here, music and proteins, such strategies can also provide an interpretable graph structure with rich information at the node, edge, and subgraph level that provides specific insights into mechanisms and relationships. We discuss other approaches to improve generative qualities, including nonlinear sampling strategies and agent-based modeling that offer enhancements over single-shot generations, whereby LLMs are used to both generate content and assess content against an objective target. Examples provided include complex question answering, code generation, and execution in the context of automated force-field development from actively learned density functional theory (DFT) modeling and data analysis.
Collapse
Affiliation(s)
- Markus J. Buehler
- Laboratory
for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
- Department
of Civil and Environmental Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
- Department
of Mechanical Engineering, Massachusetts
Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
- Center
for Computational Science and Engineering, Schwarzman College of Computing, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
| |
Collapse
|
4
|
Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, Shimizu T. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration. JMIR Med Inform 2024; 12:e55627. [PMID: 38592758 DOI: 10.2196/55627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 02/14/2024] [Accepted: 03/13/2024] [Indexed: 04/10/2024] Open
Abstract
BACKGROUND In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear. OBJECTIVE This study aims to assess the impact of adding image data on ChatGPT-4's diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data. METHODS We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis. RESULTS The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V's performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ2 test). Additionally, ChatGPT-4's self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases. CONCLUSIONS Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine.
Collapse
Affiliation(s)
- Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Kazuki Tokumasu
- Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan
| | | | - Tomoharu Suzuki
- Department of Hospital Medicine, Urasoe General Hospital, Okinawa, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| |
Collapse
|
5
|
Xu X, Li C, Yuan X, Zhang Q, Liu Y, Zhu Y, Chen T. ACP-DRL: an anticancer peptides recognition method based on deep representation learning. Front Genet 2024; 15:1376486. [PMID: 38655048 PMCID: PMC11035771 DOI: 10.3389/fgene.2024.1376486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 03/25/2024] [Indexed: 04/26/2024] Open
Abstract
Cancer, a significant global public health issue, resulted in about 10 million deaths in 2022. Anticancer peptides (ACPs), as a category of bioactive peptides, have emerged as a focal point in clinical cancer research due to their potential to inhibit tumor cell proliferation with minimal side effects. However, the recognition of ACPs through wet-lab experiments still faces challenges of low efficiency and high cost. Our work proposes a recognition method for ACPs named ACP-DRL based on deep representation learning, to address the challenges associated with the recognition of ACPs in wet-lab experiments. ACP-DRL marks initial exploration of integrating protein language models into ACPs recognition, employing in-domain further pre-training to enhance the development of deep representation learning. Simultaneously, it employs bidirectional long short-term memory networks to extract amino acid features from sequences. Consequently, ACP-DRL eliminates constraints on sequence length and the dependence on manual features, showcasing remarkable competitiveness in comparison with existing methods.
Collapse
Affiliation(s)
- Xiaofang Xu
- State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences(Beijing), Beijing Institute of Lifeomics, Beijing, China
| | - Chaoran Li
- State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences(Beijing), Beijing Institute of Lifeomics, Beijing, China
| | - Xinpu Yuan
- Department of General Surgery, First Medical Center, Chinese PLA General Hospital, Beijing, China
| | - Qiangjian Zhang
- Institute of Dataspace, Hefei Comprehensive National Science Center, Hefei, China
| | - Yi Liu
- State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences(Beijing), Beijing Institute of Lifeomics, Beijing, China
| | - Yunping Zhu
- State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences(Beijing), Beijing Institute of Lifeomics, Beijing, China
| | - Tao Chen
- State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences(Beijing), Beijing Institute of Lifeomics, Beijing, China
| |
Collapse
|
6
|
Beltrami EJ, Grant-Kels JM. Consulting ChatGPT: Ethical dilemmas in language model artificial intelligence. J Am Acad Dermatol 2024; 90:879-880. [PMID: 36907556 DOI: 10.1016/j.jaad.2023.02.052] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 02/14/2023] [Accepted: 02/28/2023] [Indexed: 03/12/2023]
Affiliation(s)
- Eric J Beltrami
- University of Connecticut School of Medicine, Farmington, CT
| | - Jane Margaret Grant-Kels
- Department of Dermatology, University of Connecticut Health Center, Farmington, CT; Department of Dermatology, University of Florida, Gainesville, FL.
| |
Collapse
|
7
|
Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai SL, Brat GA. Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments. Surgery 2024; 175:936-942. [PMID: 38246839 PMCID: PMC10947829 DOI: 10.1016/j.surg.2023.12.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 12/09/2023] [Accepted: 12/15/2023] [Indexed: 01/23/2024]
Abstract
BACKGROUND Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries. METHODS We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries. RESULTS A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question (n = 16, 36.4%), inaccurate information in a fact-based question (n = 11, 25.0%), and accurate information with circumstantial discrepancy (n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions. CONCLUSION Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.
Collapse
Affiliation(s)
- Brendin R Beaulieu-Jones
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA. https://twitter.com/bratogram
| | | | - Sahaj Shah
- Geisinger Commonwealth School of Medicine, Scranton, PA
| | - Jayson S Marwaha
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Shuo-Lun Lai
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Gabriel A Brat
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA.
| |
Collapse
|
8
|
Klein E, Kinsella M, Stevens I, Fried-Oken M. Ethical issues raised by incorporating personalized language models into brain-computer interface communication technologies: a qualitative study of individuals with neurological disease. Disabil Rehabil Assist Technol 2024; 19:1041-1051. [PMID: 36403143 PMCID: PMC10351684 DOI: 10.1080/17483107.2022.2146217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Revised: 09/01/2022] [Accepted: 11/07/2022] [Indexed: 11/21/2022]
Abstract
PURPOSE To examine the views of individuals with neurodegenerative diseases about ethical issues related to incorporating personalized language models into brain-computer interface (BCI) communication technologies. METHODS Fifteen semi-structured interviews and 51 online free response surveys were completed with individuals diagnosed with neurodegenerative disease that could lead to loss of speech and motor skills. Each participant responded to questions after six hypothetical ethics vignettes were presented that address the possibility of building language models with personal words and phrases in BCI communication technologies. Data were analyzed with consensus coding, using modified grounded theory. RESULTS Four themes were identified. (1) The experience of a neurodegenerative disease shapes preferences for personalized language models. (2) An individual's identity will be affected by the ability to personalize the language model. (3) The motivation for personalization is tied to how relationships can be helped or harmed. (4) Privacy is important to people who may need BCI communication technologies. Responses suggest that the inclusion of personal lexica raises ethical issues. Stakeholders want their values to be considered during development of BCI communication technologies. CONCLUSIONS With the rapid development of BCI communication technologies, it is critical to incorporate feedback from individuals regarding their ethical concerns about the storage and use of personalized language models. Stakeholder values and preferences about disability, privacy, identity and relationships should drive design, innovation and implementation.IMPLICATIONS FOR REHABILITATIONIndividuals with neurodegenerative diseases are important stakeholders to consider in development of natural language processing within brain-computer interface (BCI) communication technologies.The incorporation of personalized language models raises issues related to disability, identity, relationships, and privacy.People who may one day rely on BCI communication technologies care not just about usability of communication technology but about technology that supports their values and priorities.Qualitative ethics-focused research is a valuable tool for exploring stakeholder perspectives on new capabilities of BCI communication technologies, such as the storage and use of personalized language models.
Collapse
Affiliation(s)
- Eran Klein
- Department of Neurology, Oregon Health & Science University, Portland, OR USA
| | - Michelle Kinsella
- Institute on Development and Disability, Oregon Health & Science University, Portland, OR USA
| | - Ian Stevens
- Department of Neurosurgery, Oregon Health & Science University, Portland, OR USA
| | - Melanie Fried-Oken
- Department of Neurology, Oregon Health & Science University, Portland, OR USA
- Institute on Development and Disability, Oregon Health & Science University, Portland, OR USA
| |
Collapse
|
9
|
Kauf C, Tuckute G, Levy R, Andreas J, Fedorenko E. Lexical-Semantic Content, Not Syntactic Structure, Is the Main Contributor to ANN-Brain Similarity of fMRI Responses in the Language Network. Neurobiol Lang (Camb) 2024; 5:7-42. [PMID: 38645614 PMCID: PMC11025651 DOI: 10.1162/nol_a_00116] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 07/11/2023] [Indexed: 04/23/2024]
Abstract
Representations from artificial neural network (ANN) language models have been shown to predict human brain activity in the language network. To understand what aspects of linguistic stimuli contribute to ANN-to-brain similarity, we used an fMRI data set of responses to n = 627 naturalistic English sentences (Pereira et al., 2018) and systematically manipulated the stimuli for which ANN representations were extracted. In particular, we (i) perturbed sentences' word order, (ii) removed different subsets of words, or (iii) replaced sentences with other sentences of varying semantic similarity. We found that the lexical-semantic content of the sentence (largely carried by content words) rather than the sentence's syntactic form (conveyed via word order or function words) is primarily responsible for the ANN-to-brain similarity. In follow-up analyses, we found that perturbation manipulations that adversely affect brain predictivity also lead to more divergent representations in the ANN's embedding space and decrease the ANN's ability to predict upcoming tokens in those stimuli. Further, results are robust as to whether the mapping model is trained on intact or perturbed stimuli and whether the ANN sentence representations are conditioned on the same linguistic context that humans saw. The critical result-that lexical-semantic content is the main contributor to the similarity between ANN representations and neural ones-aligns with the idea that the goal of the human language system is to extract meaning from linguistic strings. Finally, this work highlights the strength of systematic experimental manipulations for evaluating how close we are to accurate and generalizable models of the human language network.
Collapse
Affiliation(s)
- Carina Kauf
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Greta Tuckute
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Roger Levy
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jacob Andreas
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Evelina Fedorenko
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- Program in Speech and Hearing Bioscience and Technology, Harvard University, Cambridge, MA, USA
| |
Collapse
|
10
|
Antonello R, Huth A. Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data. Neurobiol Lang (Camb) 2024; 5:64-79. [PMID: 38645616 PMCID: PMC11025645 DOI: 10.1162/nol_a_00087] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 10/26/2022] [Indexed: 04/23/2024]
Abstract
Many recent studies have shown that representations drawn from neural network language models are extremely effective at predicting brain responses to natural language. But why do these models work so well? One proposed explanation is that language models and brains are similar because they have the same objective: to predict upcoming words before they are perceived. This explanation is attractive because it lends support to the popular theory of predictive coding. We provide several analyses that cast doubt on this claim. First, we show that the ability to predict future words does not uniquely (or even best) explain why some representations are a better match to the brain than others. Second, we show that within a language model, representations that are best at predicting future words are strictly worse brain models than other representations. Finally, we argue in favor of an alternative explanation for the success of language models in neuroscience: These models are effective at predicting brain responses because they generally capture a wide variety of linguistic phenomena.
Collapse
Affiliation(s)
- Richard Antonello
- Department of Computer Science, University of Texas at Austin, Austin, TX, USA
| | - Alexander Huth
- Department of Computer Science, University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
11
|
Noda M, Ueno T, Koshu R, Takaso Y, Shimada MD, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A, Yoshizaki T. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR Med Educ 2024; 10:e57054. [PMID: 38546736 PMCID: PMC11009855 DOI: 10.2196/57054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 02/22/2024] [Accepted: 03/09/2024] [Indexed: 04/14/2024]
Abstract
BACKGROUND Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. OBJECTIVE This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. METHODS Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. RESULTS The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). CONCLUSIONS Examination of artificial intelligence's answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed.
Collapse
Affiliation(s)
- Masao Noda
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Takayoshi Ueno
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Ryota Koshu
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Yuji Takaso
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Mari Dias Shimada
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Chizu Saito
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Hisashi Sugimoto
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Hiroaki Fushiki
- Department of Otolaryngology, Mejiro University Ear Institute Clinic, Saitama, Japan
| | - Makoto Ito
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Akihiro Nomura
- College of Transdisciplinary Sciences for Innovation, Kanazawa University, Kanazawa, Japan
| | - Tomokazu Yoshizaki
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| |
Collapse
|
12
|
Elyoseph Z, Levkovich I. Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study. JMIR Ment Health 2024; 11:e53043. [PMID: 38533615 PMCID: PMC11004608 DOI: 10.2196/53043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 01/24/2024] [Accepted: 02/11/2024] [Indexed: 03/28/2024] Open
Abstract
Background The current paradigm in mental health care focuses on clinical recovery and symptom remission. This model's efficacy is influenced by therapist trust in patient recovery potential and the depth of the therapeutic relationship. Schizophrenia is a chronic illness with severe symptoms where the possibility of recovery is a matter of debate. As artificial intelligence (AI) becomes integrated into the health care field, it is important to examine its ability to assess recovery potential in major psychiatric disorders such as schizophrenia. Objective This study aimed to evaluate the ability of large language models (LLMs) in comparison to mental health professionals to assess the prognosis of schizophrenia with and without professional treatment and the long-term positive and negative outcomes. Methods Vignettes were inputted into LLMs interfaces and assessed 10 times by 4 AI platforms: ChatGPT-3.5, ChatGPT-4, Google Bard, and Claude. A total of 80 evaluations were collected and benchmarked against existing norms to analyze what mental health professionals (general practitioners, psychiatrists, clinical psychologists, and mental health nurses) and the general public think about schizophrenia prognosis with and without professional treatment and the positive and negative long-term outcomes of schizophrenia interventions. Results For the prognosis of schizophrenia with professional treatment, ChatGPT-3.5 was notably pessimistic, whereas ChatGPT-4, Claude, and Bard aligned with professional views but differed from the general public. All LLMs believed untreated schizophrenia would remain static or worsen without professional treatment. For long-term outcomes, ChatGPT-4 and Claude predicted more negative outcomes than Bard and ChatGPT-3.5. For positive outcomes, ChatGPT-3.5 and Claude were more pessimistic than Bard and ChatGPT-4. Conclusions The finding that 3 out of the 4 LLMs aligned closely with the predictions of mental health professionals when considering the "with treatment" condition is a demonstration of the potential of this technology in providing professional clinical prognosis. The pessimistic assessment of ChatGPT-3.5 is a disturbing finding since it may reduce the motivation of patients to start or persist with treatment for schizophrenia. Overall, although LLMs hold promise in augmenting health care, their application necessitates rigorous validation and a harmonious blend with human expertise.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College, Kiryat Tiv'on, Israel
| |
Collapse
|
13
|
Nakao T, Miki S, Nakamura Y, Kikuchi T, Nomura Y, Hanaoka S, Yoshikawa T, Abe O. Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study. JMIR Med Educ 2024; 10:e54393. [PMID: 38470459 DOI: 10.2196/54393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/26/2023] [Accepted: 02/16/2024] [Indexed: 03/13/2024]
Abstract
BACKGROUND Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. OBJECTIVE We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. METHODS We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. RESULTS Among the 108 questions with images, GPT-4V's accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively. CONCLUSIONS The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination.
Collapse
Affiliation(s)
- Takahiro Nakao
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Soichiro Miki
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Yuta Nakamura
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Tomohiro Kikuchi
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
- Department of Radiology, School of Medicine, Jichi Medical University, Shimotsuke, Tochigi, Japan
| | - Yukihiro Nomura
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
- Center for Frontier Medical Engineering, Chiba University, Inage-ku, Chiba, Japan
| | - Shouhei Hanaoka
- Department of Radiology, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Takeharu Yoshikawa
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Osamu Abe
- Department of Radiology, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| |
Collapse
|
14
|
Davis J, Van Bulck L, Durieux BN, Lindvall C. The Temperature Feature of ChatGPT: Modifying Creativity for Clinical Research. JMIR Hum Factors 2024; 11:e53559. [PMID: 38457221 PMCID: PMC10960206 DOI: 10.2196/53559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 12/11/2023] [Accepted: 01/24/2024] [Indexed: 03/09/2024] Open
Abstract
More clinicians and researchers are exploring uses for large language model chatbots, such as ChatGPT, for research, dissemination, and educational purposes. Therefore, it becomes increasingly relevant to consider the full potential of this tool, including the special features that are currently available through the application programming interface. One of these features is a variable called temperature, which changes the degree to which randomness is involved in the model's generated output. This is of particular interest to clinicians and researchers. By lowering this variable, one can generate more consistent outputs; by increasing it, one can receive more creative responses. For clinicians and researchers who are exploring these tools for a variety of tasks, the ability to tailor outputs to be less creative may be beneficial for work that demands consistency. Additionally, access to more creative text generation may enable scientific authors to describe their research in more general language and potentially connect with a broader public through social media. In this viewpoint, we present the temperature feature, discuss potential uses, and provide some examples.
Collapse
Affiliation(s)
- Joshua Davis
- Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, MA, United States
- Albany Medical College, Albany, NY, United States
| | - Liesbet Van Bulck
- KU Leuven Department of Public Health and Primary Care, KU Leuven-University of Leuven, Leuven, Belgium
- Research Foundation Flanders (FWO), Brussels, Belgium
| | - Brigitte N Durieux
- Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, MA, United States
| | - Charlotta Lindvall
- Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, MA, United States
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, United States
- Harvard Medical School, Harvard University, Boston, MA, United States
| |
Collapse
|
15
|
Reynolds K, Tejasvi T. Potential Use of ChatGPT in Responding to Patient Questions and Creating Patient Resources. JMIR Dermatol 2024; 7:e48451. [PMID: 38446541 PMCID: PMC10955382 DOI: 10.2196/48451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 08/11/2023] [Accepted: 02/22/2024] [Indexed: 03/07/2024] Open
Abstract
ChatGPT (OpenAI) is an artificial intelligence-based free natural language processing model that generates complex responses to user-generated prompts. The advent of this tool comes at a time when physician burnout is at an all-time high, which is attributed at least in part to time spent outside of the patient encounter within the electronic medical record (documenting the encounter, responding to patient messages, etc). Although ChatGPT is not specifically designed to provide medical information, it can generate preliminary responses to patients' questions about their medical conditions and can precipitately create educational patient resources, which do inevitably require rigorous editing and fact-checking on the part of the health care provider to ensure accuracy. In this way, this assistive technology has the potential to not only enhance a physician's efficiency and work-life balance but also enrich the patient-physician relationship and ultimately improve patient outcomes.
Collapse
Affiliation(s)
- Kelly Reynolds
- Department of Dermatology, University of Michigan, Ann Arbor, MI, United States
| | - Trilokraj Tejasvi
- Department of Dermatology, University of Michigan, Ann Arbor, MI, United States
| |
Collapse
|
16
|
Roster K, Kann RB, Farabi B, Gronbeck C, Brownstone N, Lipner SR. Readability and Health Literacy Scores for ChatGPT-Generated Dermatology Public Education Materials: Cross-Sectional Analysis of Sunscreen and Melanoma Questions. JMIR Dermatol 2024; 7:e50163. [PMID: 38446502 PMCID: PMC10955394 DOI: 10.2196/50163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 01/02/2024] [Accepted: 02/06/2024] [Indexed: 03/07/2024] Open
Affiliation(s)
- Katie Roster
- New York Medical College, New York, NY, United States
| | | | - Banu Farabi
- Dermatology Department, NYC Health + Hospital/Metropolitan, New York, NY, United States
| | - Christian Gronbeck
- Department of Dermatology, University of Connecticut HealthCenter, Framington, CT, United States
| | - Nicholas Brownstone
- Department of Dermatology, Temple University Hospital, Philadelphia, PA, United States
| | - Shari R Lipner
- Department of Dermatology, Weill Cornell Medicine, New York, NY, United States
| |
Collapse
|
17
|
Rodriguez DV, Lawrence K, Gonzalez J, Brandfield-Harvey B, Xu L, Tasneem S, Levine DL, Mann D. Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study. JMIR Hum Factors 2024; 11:e52885. [PMID: 38446539 PMCID: PMC10955400 DOI: 10.2196/52885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 11/27/2023] [Accepted: 12/15/2023] [Indexed: 03/07/2024] Open
Abstract
BACKGROUND Generative artificial intelligence has the potential to revolutionize health technology product development by improving coding quality, efficiency, documentation, quality assessment and review, and troubleshooting. OBJECTIVE This paper explores the application of a commercially available generative artificial intelligence tool (ChatGPT) to the development of a digital health behavior change intervention designed to support patient engagement in a commercial digital diabetes prevention program. METHODS We examined the capacity, advantages, and limitations of ChatGPT to support digital product idea conceptualization, intervention content development, and the software engineering process, including software requirement generation, software design, and code production. In total, 11 evaluators, each with at least 10 years of experience in fields of study ranging from medicine and implementation science to computer science, participated in the output review process (ChatGPT vs human-generated output). All had familiarity or prior exposure to the original personalized automatic messaging system intervention. The evaluators rated the ChatGPT-produced outputs in terms of understandability, usability, novelty, relevance, completeness, and efficiency. RESULTS Most metrics received positive scores. We identified that ChatGPT can (1) support developers to achieve high-quality products faster and (2) facilitate nontechnical communication and system understanding between technical and nontechnical team members around the development goal of rapid and easy-to-build computational solutions for medical technologies. CONCLUSIONS ChatGPT can serve as a usable facilitator for researchers engaging in the software development life cycle, from product conceptualization to feature identification and user story development to code generation. TRIAL REGISTRATION ClinicalTrials.gov NCT04049500; https://clinicaltrials.gov/ct2/show/NCT04049500.
Collapse
Affiliation(s)
- Danissa V Rodriguez
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Katharine Lawrence
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| | - Javier Gonzalez
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| | - Beatrix Brandfield-Harvey
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Lynn Xu
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Sumaiya Tasneem
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Defne L Levine
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Devin Mann
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| |
Collapse
|
18
|
Goldstein JA, Chao J, Grossman S, Stamos A, Tomz M. How persuasive is AI-generated propaganda? PNAS Nexus 2024; 3:pgae034. [PMID: 38380055 PMCID: PMC10878360 DOI: 10.1093/pnasnexus/pgae034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 01/03/2024] [Indexed: 02/22/2024]
Abstract
Can large language models, a form of artificial intelligence (AI), generate persuasive propaganda? We conducted a preregistered survey experiment of US respondents to investigate the persuasiveness of news articles written by foreign propagandists compared to content generated by GPT-3 davinci (a large language model). We found that GPT-3 can create highly persuasive text as measured by participants' agreement with propaganda theses. We further investigated whether a person fluent in English could improve propaganda persuasiveness. Editing the prompt fed to GPT-3 and/or curating GPT-3's output made GPT-3 even more persuasive, and, under certain conditions, as persuasive as the original propaganda. Our findings suggest that propagandists could use AI to create convincing content with limited effort.
Collapse
Affiliation(s)
- Josh A Goldstein
- Center for Security and Emerging Technology, Georgetown University, Washington, DC 20001, USA
| | - Jason Chao
- Stanford Internet Observatory, Stanford University, Stanford, CA 94305, USA
| | - Shelby Grossman
- Stanford Internet Observatory, Stanford University, Stanford, CA 94305, USA
| | - Alex Stamos
- Stanford Internet Observatory, Stanford University, Stanford, CA 94305, USA
| | - Michael Tomz
- Department of Political Science and Stanford Institute for Economic Policy Research, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
19
|
Ligeti B, Szepesi-Nagy I, Bodnár B, Ligeti-Nagy N, Juhász J. ProkBERT family: genomic language models for microbiome applications. Front Microbiol 2024; 14:1331233. [PMID: 38282738 PMCID: PMC10810988 DOI: 10.3389/fmicb.2023.1331233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/11/2023] [Indexed: 01/30/2024] Open
Abstract
Background In the evolving landscape of microbiology and microbiome analysis, the integration of machine learning is crucial for understanding complex microbial interactions, and predicting and recognizing novel functionalities within extensive datasets. However, the effectiveness of these methods in microbiology faces challenges due to the complex and heterogeneous nature of microbial data, further complicated by low signal-to-noise ratios, context-dependency, and a significant shortage of appropriately labeled datasets. This study introduces the ProkBERT model family, a collection of large language models, designed for genomic tasks. It provides a generalizable sequence representation for nucleotide sequences, learned from unlabeled genome data. This approach helps overcome the above-mentioned limitations in the field, thereby improving our understanding of microbial ecosystems and their impact on health and disease. Methods ProkBERT models are based on transfer learning and self-supervised methodologies, enabling them to use the abundant yet complex microbial data effectively. The introduction of the novel Local Context-Aware (LCA) tokenization technique marks a significant advancement, allowing ProkBERT to overcome the contextual limitations of traditional transformer models. This methodology not only retains rich local context but also demonstrates remarkable adaptability across various bioinformatics tasks. Results In practical applications such as promoter prediction and phage identification, the ProkBERT models show superior performance. For promoter prediction tasks, the top-performing model achieved a Matthews Correlation Coefficient (MCC) of 0.74 for E. coli and 0.62 in mixed-species contexts. In phage identification, ProkBERT models consistently outperformed established tools like VirSorter2 and DeepVirFinder, achieving an MCC of 0.85. These results underscore the models' exceptional accuracy and generalizability in both supervised and unsupervised tasks. Conclusions The ProkBERT model family is a compact yet powerful tool in the field of microbiology and bioinformatics. Its capacity for rapid, accurate analyses and its adaptability across a spectrum of tasks marks a significant advancement in machine learning applications in microbiology. The models are available on GitHub (https://github.com/nbrg-ppcu/prokbert) and HuggingFace (https://huggingface.co/nerualbioinfo) providing an accessible tool for the community.
Collapse
Affiliation(s)
- Balázs Ligeti
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - István Szepesi-Nagy
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - Babett Bodnár
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - Noémi Ligeti-Nagy
- Language Technology Research Group, HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary
| | - János Juhász
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
- Institute of Medical Microbiology, Semmelweis University, Budapest, Hungary
| |
Collapse
|
20
|
Dergaa I, Fekih-Romdhane F, Hallit S, Loch AA, Glenn JM, Fessi MS, Ben Aissa M, Souissi N, Guelmami N, Swed S, El Omri A, Bragazzi NL, Ben Saad H. ChatGPT is not ready yet for use in providing mental health assessment and interventions. Front Psychiatry 2024; 14:1277756. [PMID: 38239905 PMCID: PMC10794665 DOI: 10.3389/fpsyt.2023.1277756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 11/17/2023] [Indexed: 01/22/2024] Open
Abstract
Background Psychiatry is a specialized field of medicine that focuses on the diagnosis, treatment, and prevention of mental health disorders. With advancements in technology and the rise of artificial intelligence (AI), there has been a growing interest in exploring the potential of AI language models systems, such as Chat Generative Pre-training Transformer (ChatGPT), to assist in the field of psychiatry. Objective Our study aimed to evaluates the effectiveness, reliability and safeness of ChatGPT in assisting patients with mental health problems, and to assess its potential as a collaborative tool for mental health professionals through a simulated interaction with three distinct imaginary patients. Methods Three imaginary patient scenarios (cases A, B, and C) were created, representing different mental health problems. All three patients present with, and seek to eliminate, the same chief complaint (i.e., difficulty falling asleep and waking up frequently during the night in the last 2°weeks). ChatGPT was engaged as a virtual psychiatric assistant to provide responses and treatment recommendations. Results In case A, the recommendations were relatively appropriate (albeit non-specific), and could potentially be beneficial for both users and clinicians. However, as complexity of clinical cases increased (cases B and C), the information and recommendations generated by ChatGPT became inappropriate, even dangerous; and the limitations of the program became more glaring. The main strengths of ChatGPT lie in its ability to provide quick responses to user queries and to simulate empathy. One notable limitation is ChatGPT inability to interact with users to collect further information relevant to the diagnosis and management of a patient's clinical condition. Another serious limitation is ChatGPT inability to use critical thinking and clinical judgment to drive patient's management. Conclusion As for July 2023, ChatGPT failed to give the simple medical advice given certain clinical scenarios. This supports that the quality of ChatGPT-generated content is still far from being a guide for users and professionals to provide accurate mental health information. It remains, therefore, premature to conclude on the usefulness and safety of ChatGPT in mental health practice.
Collapse
Affiliation(s)
- Ismail Dergaa
- Primary Health Care Corporation (PHCC), Doha, Qatar
- Research Unit Physical Activity, Sport, and Health, UR18JS01, National Observatory of Sport, Tunis, Tunisia
- High Institute of Sport and Physical Education, University of Sfax, Sfax, Tunisia
| | - Feten Fekih-Romdhane
- The Tunisian Center of Early Intervention in Psychosis, Department of Psychiatry “Ibn Omrane”, Razi Hospital, Manouba, Tunisia
- Faculty of Medicine of Tunis, Tunis El Manar University, Tunis, Tunisia
| | - Souheil Hallit
- School of Medicine and Medical Sciences, Holy Spirit University of Kaslik, Jounieh, Lebanon
- Psychology Department, College of Humanities, Effat University, Jeddah, Saudi Arabia
- Applied Science Research Center, Applied Science Private University, Amman, Jordan
| | - Alexandre Andrade Loch
- Laboratorio de Neurociencias (LIM 27), Hospital das Clínicas HCFMUSP, Faculdade de Medicina, Instituto de Psiquiatria, Universidade de Sao Paulo, São Paulo, Brazil
- Instituto Nacional de Biomarcadores em Neuropsiquiatria (INBION), Conselho Nacional de Desenvolvimento Científico e Tecnológico, São Paulo, Brazil
| | | | | | - Mohamed Ben Aissa
- Department of Human and Social Sciences, Higher Institute of Sport and Physical Education of Kef, University of Jendouba, Jendouba, Tunisia
| | - Nizar Souissi
- Research Unit Physical Activity, Sport, and Health, UR18JS01, National Observatory of Sport, Tunis, Tunisia
| | - Noomen Guelmami
- Department of Health Sciences (DISSAL), Postgraduate School of Public Health, University of Genoa, Genoa, Italy
| | - Sarya Swed
- Faculty of Medicine, Aleppo University, Aleppo, Syria
| | - Abdelfatteh El Omri
- Surgical Research Section, Department of Surgery, Hamad Medical Corporation, Doha, Qatar
| | - Nicola Luigi Bragazzi
- Laboratory for Industrial and Applied Mathematics, Department of Mathematics and Statistics, York University, Toronto, ON, Canada
| | - Helmi Ben Saad
- Service of Physiology and Functional Explorations, Farhat HACHED Hospital, University of Sousse, Sousse, Tunisia
- Heart Failure (LR12SP09) Research Laboratory, Farhat HACHED Hospital, University of Sousse, Sousse, Tunisia
| |
Collapse
|
21
|
Bajorath J. Chemical language models for molecular design. Mol Inform 2024; 43:e202300288. [PMID: 38010610 DOI: 10.1002/minf.202300288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/22/2023] [Accepted: 11/23/2023] [Indexed: 11/29/2023]
Abstract
In drug discovery, chemical language models (CLMs) originating from natural language processing offer new opportunities for molecular design. CLMs have been developed using recurrent neural network (RNN) or transformer architectures. For the predictive performance of RNN-based encoder-decoder frameworks and transformers, attention mechanisms play a central role. Among others, emerging application areas for CLMs include constrained generative modeling and the prediction of chemical reactions or drug-target interactions. Since CLMs are applicable to any compound or target data that can be presented in a sequential format and tokenized, mappings of different types of sequences can be learned. For example, active compounds can be predicted from protein sequence motifs. Novel off-the-beat-path applications can also be considered. For example, analogue series from medicinal chemistry can be perceived and represented as chemical sequences and extended with new compounds using CLMs. Herein, methodological features of CLMs and different applications are discussed.
Collapse
Affiliation(s)
- Jürgen Bajorath
- Department of Life Science Informatics, Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, Friedrich-Hirzebruch-Allee 5/6, D-53115, Bonn, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität Bonn, Friedrich-Hirzebruch-Allee 5/6, D-53115, Bonn, Germany
| |
Collapse
|
22
|
Cheng SL, Tsai SJ, Bai YM, Ko CH, Hsu CW, Yang FC, Tsai CK, Tu YK, Yang SN, Tseng PT, Hsu TW, Liang CS, Su KP. Comparisons of Quality, Correctness, and Similarity Between ChatGPT-Generated and Human-Written Abstracts for Basic Research: Cross-Sectional Study. J Med Internet Res 2023; 25:e51229. [PMID: 38145486 PMCID: PMC10760418 DOI: 10.2196/51229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 10/17/2023] [Accepted: 11/20/2023] [Indexed: 12/26/2023] Open
Abstract
BACKGROUND ChatGPT may act as a research assistant to help organize the direction of thinking and summarize research findings. However, few studies have examined the quality, similarity (abstracts being similar to the original one), and accuracy of the abstracts generated by ChatGPT when researchers provide full-text basic research papers. OBJECTIVE We aimed to assess the applicability of an artificial intelligence (AI) model in generating abstracts for basic preclinical research. METHODS We selected 30 basic research papers from Nature, Genome Biology, and Biological Psychiatry. Excluding abstracts, we inputted the full text into ChatPDF, an application of a language model based on ChatGPT, and we prompted it to generate abstracts with the same style as used in the original papers. A total of 8 experts were invited to evaluate the quality of these abstracts (based on a Likert scale of 0-10) and identify which abstracts were generated by ChatPDF, using a blind approach. These abstracts were also evaluated for their similarity to the original abstracts and the accuracy of the AI content. RESULTS The quality of ChatGPT-generated abstracts was lower than that of the actual abstracts (10-point Likert scale: mean 4.72, SD 2.09 vs mean 8.09, SD 1.03; P<.001). The difference in quality was significant in the unstructured format (mean difference -4.33; 95% CI -4.79 to -3.86; P<.001) but minimal in the 4-subheading structured format (mean difference -2.33; 95% CI -2.79 to -1.86). Among the 30 ChatGPT-generated abstracts, 3 showed wrong conclusions, and 10 were identified as AI content. The mean percentage of similarity between the original and the generated abstracts was not high (2.10%-4.40%). The blinded reviewers achieved a 93% (224/240) accuracy rate in guessing which abstracts were written using ChatGPT. CONCLUSIONS Using ChatGPT to generate a scientific abstract may not lead to issues of similarity when using real full texts written by humans. However, the quality of the ChatGPT-generated abstracts was suboptimal, and their accuracy was not 100%.
Collapse
Affiliation(s)
- Shu-Li Cheng
- Department of Nursing, Mackay Medical College, Taipei, Taiwan
| | - Shih-Jen Tsai
- Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan
- Division of Psychiatry, School of Medicine, National Yang-Ming University, Taipei, Taiwan
| | - Ya-Mei Bai
- Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan
- Division of Psychiatry, School of Medicine, National Yang-Ming University, Taipei, Taiwan
| | - Chih-Hung Ko
- Department of Psychiatry, Kaohsiung Medical University Hospital, Kaohsiung, Taiwan
- Department of Psychiatry, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
- Department of Psychiatry, Kaohsiung Municipal Siaogang Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Chih-Wei Hsu
- Department of Psychiatry, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung, Taiwan
| | - Fu-Chi Yang
- Department of Neurology, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Chia-Kuang Tsai
- Department of Neurology, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Yu-Kang Tu
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
- Department of Dentistry, National Taiwan University Hospital, Taipei, Taiwan
| | - Szu-Nian Yang
- Department of Psychiatry, Tri-service Hospital, Beitou branch, Taipei, Taiwan
- Department of Psychiatry, Armed Forces Taoyuan General Hospital, Taoyuan, Taiwan
- Graduate Institute of Health and Welfare Policy, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Ping-Tao Tseng
- Institute of Biomedical Sciences, Institute of Precision Medicine, National Sun Yat-sen University, Kaohsiung, Taiwan
- Department of Psychology, College of Medical and Health Science, Asia University, Taichung, Taiwan
- Prospect Clinic for Otorhinolaryngology and Neurology, Kaohsiung, Taiwan
| | - Tien-Wei Hsu
- Department of Psychiatry, E-Da Dachang Hospital, I-Shou University, Kaohsiung, Taiwan
- Department of Psychiatry, E-Da Hospital, I-Shou University, Kaohsiung, Taiwan
| | - Chih-Sung Liang
- Department of Psychiatry, Tri-service Hospital, Beitou branch, Taipei, Taiwan
- Department of Psychiatry, National Defense Medical Center, Taipei, Taiwan
| | - Kuan-Pin Su
- College of Medicine, China Medical University, Taichung, Taiwan
- Mind-Body Interface Laboratory, China Medical University and Hospital, Taichung, Taiwan
- An-Nan Hospital, China Medical University, Tainan, Taiwan
| |
Collapse
|
23
|
Watari T, Takagi S, Sakaguchi K, Nishizaki Y, Shimizu T, Yamamoto Y, Tokuda Y. Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study. JMIR Med Educ 2023; 9:e52202. [PMID: 38055323 PMCID: PMC10733815 DOI: 10.2196/52202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 10/22/2023] [Accepted: 11/03/2023] [Indexed: 12/07/2023]
Abstract
BACKGROUND The reliability of GPT-4, a state-of-the-art expansive language model specializing in clinical reasoning and medical knowledge, remains largely unverified across non-English languages. OBJECTIVE This study aims to compare fundamental clinical competencies between Japanese residents and GPT-4 by using the General Medicine In-Training Examination (GM-ITE). METHODS We used the GPT-4 model provided by OpenAI and the GM-ITE examination questions for the years 2020, 2021, and 2022 to conduct a comparative analysis. This analysis focused on evaluating the performance of individuals who were concluding their second year of residency in comparison to that of GPT-4. Given the current abilities of GPT-4, our study included only single-choice exam questions, excluding those involving audio, video, or image data. The assessment included 4 categories: general theory (professionalism and medical interviewing), symptomatology and clinical reasoning, physical examinations and clinical procedures, and specific diseases. Additionally, we categorized the questions into 7 specialty fields and 3 levels of difficulty, which were determined based on residents' correct response rates. RESULTS Upon examination of 137 GM-ITE questions in Japanese, GPT-4 scores were significantly higher than the mean scores of residents (residents: 55.8%, GPT-4: 70.1%; P<.001). In terms of specific disciplines, GPT-4 scored 23.5 points higher in the "specific diseases," 30.9 points higher in "obstetrics and gynecology," and 26.1 points higher in "internal medicine." In contrast, GPT-4 scores in "medical interviewing and professionalism," "general practice," and "psychiatry" were lower than those of the residents, although this discrepancy was not statistically significant. Upon analyzing scores based on question difficulty, GPT-4 scores were 17.2 points lower for easy problems (P=.007) but were 25.4 and 24.4 points higher for normal and difficult problems, respectively (P<.001). In year-on-year comparisons, GPT-4 scores were 21.7 and 21.5 points higher in the 2020 (P=.01) and 2022 (P=.003) examinations, respectively, but only 3.5 points higher in the 2021 examinations (no significant difference). CONCLUSIONS In the Japanese language, GPT-4 also outperformed the average medical residents in the GM-ITE test, originally designed for them. Specifically, GPT-4 demonstrated a tendency to score higher on difficult questions with low resident correct response rates and those demanding a more comprehensive understanding of diseases. However, GPT-4 scored comparatively lower on questions that residents could readily answer, such as those testing attitudes toward patients and professionalism, as well as those necessitating an understanding of context and communication. These findings highlight the strengths and limitations of artificial intelligence applications in medical education and practice.
Collapse
Affiliation(s)
- Takashi Watari
- General Medicine Center, Shimane University Hospital, Izumo, Japan
- Department of Medicine, University of Michigan Medical School, Ann Arbor, MI, United States
- Medicine Service, VA Ann Arbor Healthcare System, Ann Arbor, MI, United States
| | - Soshi Takagi
- Faculty of Medicine, Shimane University, Izuom, Japan
| | - Kota Sakaguchi
- General Medicine Center, Shimane University Hospital, Izumo, Japan
| | - Yuji Nishizaki
- Division of Medical Education, Juntendo University School of Medicine, Tokyo, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University Hospital, Tochigi, Japan
| | - Yu Yamamoto
- Division of General Medicine, Center for Community Medicine, Jichi Medical University, Tochigi, Japan
| | - Yasuharu Tokuda
- Muribushi Okinawa Project for Teaching Hospitals, Okinawa, Japan
| |
Collapse
|
24
|
Semrl N, Feigl S, Taumberger N, Bracic T, Fluhr H, Blockeel C, Kollmann M. AI language models in human reproduction research: exploring ChatGPT's potential to assist academic writing. Hum Reprod 2023; 38:2281-2288. [PMID: 37833847 DOI: 10.1093/humrep/dead207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 09/06/2023] [Indexed: 10/15/2023] Open
Abstract
Artificial intelligence (AI)-driven language models have the potential to serve as an educational tool, facilitate clinical decision-making, and support research and academic writing. The benefits of their use are yet to be evaluated and concerns have been raised regarding the accuracy, transparency, and ethical implications of using this AI technology in academic publishing. At the moment, Chat Generative Pre-trained Transformer (ChatGPT) is one of the most powerful and widely debated AI language models. Here, we discuss its feasibility to answer scientific questions, identify relevant literature, and assist writing in the field of human reproduction. With consideration of the scarcity of data on this topic, we assessed the feasibility of ChatGPT in academic writing, using data from six meta-analyses published in a leading journal of human reproduction. The text generated by ChatGPT was evaluated and compared to the original text by blinded reviewers. While ChatGPT can produce high-quality text and summarize information efficiently, its current ability to interpret data and answer scientific questions is limited, and it cannot be relied upon for a literature search or accurate source citation due to the potential spread of incomplete or false information. We advocate for open discussions within the reproductive medicine research community to explore the advantages and disadvantages of implementing this AI technology. Researchers and reviewers should be informed about AI language models, and we encourage authors to transparently disclose their use.
Collapse
Affiliation(s)
- N Semrl
- Department of Obstetrics and Gynecology, Medical University of Graz, Graz, Austria
| | - S Feigl
- Department of Obstetrics and Gynecology, Medical University of Graz, Graz, Austria
| | - N Taumberger
- Department of Obstetrics and Gynecology, Medical University of Graz, Graz, Austria
| | - T Bracic
- Department of Obstetrics and Gynecology, Medical University of Graz, Graz, Austria
| | - H Fluhr
- Department of Obstetrics and Gynecology, Medical University of Graz, Graz, Austria
| | - C Blockeel
- Centre for Reproductive Medicine, Universitair Ziekenhuis Brussel (UZ Brussel), Brussels, Belgium
| | - M Kollmann
- Department of Obstetrics and Gynecology, Medical University of Graz, Graz, Austria
| |
Collapse
|
25
|
Savage T, Wang J, Shieh L. A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation. JMIR Med Inform 2023; 11:e49886. [PMID: 38010803 DOI: 10.2196/49886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 08/04/2023] [Accepted: 10/24/2023] [Indexed: 11/29/2023] Open
Abstract
BACKGROUND Best Practice Alerts (BPAs) are alert messages to physicians in the electronic health record that are used to encourage appropriate use of health care resources. While these alerts are helpful in both improving care and reducing costs, BPAs are often broadly applied nonselectively across entire patient populations. The development of large language models (LLMs) provides an opportunity to selectively identify patients for BPAs. OBJECTIVE In this paper, we present an example case where an LLM screening tool is used to select patients appropriate for a BPA encouraging the prescription of deep vein thrombosis (DVT) anticoagulation prophylaxis. The artificial intelligence (AI) screening tool was developed to identify patients experiencing acute bleeding and exclude them from receiving a DVT prophylaxis BPA. METHODS Our AI screening tool used a BioMed-RoBERTa (Robustly Optimized Bidirectional Encoder Representations from Transformers Pretraining Approach; AllenAI) model to perform classification of physician notes, identifying patients without active bleeding and thus appropriate for a thromboembolism prophylaxis BPA. The BioMed-RoBERTa model was fine-tuned using 500 history and physical notes of patients from the MIMIC-III (Medical Information Mart for Intensive Care) database who were not prescribed anticoagulation. A development set of 300 MIMIC patient notes was used to determine the model's hyperparameters, and a separate test set of 300 patient notes was used to evaluate the screening tool. RESULTS Our MIMIC-III test set population of 300 patients included 72 patients with bleeding (ie, were not appropriate for a DVT prophylaxis BPA) and 228 without bleeding who were appropriate for a DVT prophylaxis BPA. The AI screening tool achieved impressive accuracy with a precision-recall area under the curve of 0.82 (95% CI 0.75-0.89) and a receiver operator curve area under the curve of 0.89 (95% CI 0.84-0.94). The screening tool reduced the number of patients who would trigger an alert by 20% (240 instead of 300 alerts) and increased alert applicability by 14.8% (218 [90.8%] positive alerts from 240 total alerts instead of 228 [76%] positive alerts from 300 total alerts), compared to nonselectively sending alerts for all patients. CONCLUSIONS These results show a proof of concept on how language models can be used as a screening tool for BPAs. We provide an example AI screening tool that uses a HIPAA (Health Insurance Portability and Accountability Act)-compliant BioMed-RoBERTa model deployed with minimal computing power. Larger models (eg, Generative Pre-trained Transformers-3, Generative Pre-trained Transformers-4, and Pathways Language Model) will exhibit superior performance but require data use agreements to be HIPAA compliant. We anticipate LLMs to revolutionize quality improvement in hospital medicine.
Collapse
Affiliation(s)
- Thomas Savage
- Division of Hospital Medicine, Department of Medicine, Stanford University, Palo Alto, CA, United States
| | - John Wang
- Divison of Gastroenterology and Hepatology, Department of Medicine, Stanford University, Palo Alto, CA, United States
| | - Lisa Shieh
- Division of Hospital Medicine, Department of Medicine, Stanford University, Palo Alto, CA, United States
| |
Collapse
|
26
|
Mansoor S, Baek M, Juergens D, Watson JL, Baker D. Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold. Protein Sci 2023; 32:e4780. [PMID: 37695922 PMCID: PMC10578109 DOI: 10.1002/pro.4780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 09/05/2023] [Accepted: 09/07/2023] [Indexed: 09/13/2023]
Abstract
Predicting the effects of mutations on protein function and stability is an outstanding challenge. Here, we assess the performance of a variant of RoseTTAFold jointly trained for sequence and structure recovery, RFjoint , for mutation effect prediction. Without any further training, we achieve comparable accuracy in predicting mutation effects for a diverse set of protein families using RFjoint to both another zero-shot model (MSA Transformer) and a model that requires specific training on a particular protein family for mutation effect prediction (DeepSequence). Thus, although the architecture of RFjoint was developed to address the protein design problem of scaffolding functional motifs, RFjoint acquired an understanding of the mutational landscapes of proteins during model training that is equivalent to that of recently developed large protein language models. The ability to simultaneously reason over protein structure and sequence could enable even more precise mutation effect predictions following supervised training on the task. These results suggest that RFjoint has a quite broad understanding of protein sequence-structure landscapes, and can be viewed as a joint model for protein sequence and structure which could be broadly useful for protein modeling.
Collapse
Affiliation(s)
- Sanaa Mansoor
- Department of BiochemistryUniversity of WashingtonSeattleWashington, WAUSA
- Institute for Protein DesignUniversity of WashingtonSeattleWashington, WAUSA
- Molecular Engineering Graduate ProgramUniversity of WashingtonSeattleWashington, WAUSA
| | - Minkyung Baek
- Department of BiochemistryUniversity of WashingtonSeattleWashington, WAUSA
- Institute for Protein DesignUniversity of WashingtonSeattleWashington, WAUSA
- School of Biological SciencesSeoul National UniversitySeoulRepublic of Korea
| | - David Juergens
- Department of BiochemistryUniversity of WashingtonSeattleWashington, WAUSA
- Institute for Protein DesignUniversity of WashingtonSeattleWashington, WAUSA
- Molecular Engineering Graduate ProgramUniversity of WashingtonSeattleWashington, WAUSA
| | - Joseph L. Watson
- Department of BiochemistryUniversity of WashingtonSeattleWashington, WAUSA
- Institute for Protein DesignUniversity of WashingtonSeattleWashington, WAUSA
| | - David Baker
- Department of BiochemistryUniversity of WashingtonSeattleWashington, WAUSA
- Institute for Protein DesignUniversity of WashingtonSeattleWashington, WAUSA
- Howard Hughes Medical InstituteUniversity of WashingtonSeattleWashington, WAUSA
| |
Collapse
|
27
|
Benegas G, Batra SS, Song YS. DNA language models are powerful predictors of genome-wide variant effects. Proc Natl Acad Sci U S A 2023; 120:e2311219120. [PMID: 37883436 PMCID: PMC10622914 DOI: 10.1073/pnas.2311219120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 09/08/2023] [Indexed: 10/28/2023] Open
Abstract
The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Graduate Group in Computational Biology, University of California, Berkeley, CA94720
| | | | - Yun S. Song
- Computer Science Division, University of California, Berkeley, CA94720
- Department of Statistics, University of California, Berkeley, CA94720
- Center for Computational Biology, University of California, Berkeley, CA94720
| |
Collapse
|
28
|
Zhou W, Prater LC, Goldstein EV, Mooney SJ. Identifying Rare Circumstances Preceding Female Firearm Suicides: Validating A Large Language Model Approach. JMIR Ment Health 2023; 10:e49359. [PMID: 37847549 PMCID: PMC10618876 DOI: 10.2196/49359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 08/31/2023] [Accepted: 09/02/2023] [Indexed: 10/18/2023] Open
Abstract
BACKGROUND Firearm suicide has been more prevalent among males, but age-adjusted female firearm suicide rates increased by 20% from 2010 to 2020, outpacing the rate increase among males by about 8 percentage points, and female firearm suicide may have different contributing circumstances. In the United States, the National Violent Death Reporting System (NVDRS) is a comprehensive source of data on violent deaths and includes unstructured incident narrative reports from coroners or medical examiners and law enforcement. Conventional natural language processing approaches have been used to identify common circumstances preceding female firearm suicide deaths but failed to identify rarer circumstances due to insufficient training data. OBJECTIVE This study aimed to leverage a large language model approach to identify infrequent circumstances preceding female firearm suicide in the unstructured coroners or medical examiners and law enforcement narrative reports available in the NVDRS. METHODS We used the narrative reports of 1462 female firearm suicide decedents in the NVDRS from 2014 to 2018. The reports were written in English. We coded 9 infrequent circumstances preceding female firearm suicides. We experimented with predicting those circumstances by leveraging a large language model approach in a yes/no question-answer format. We measured the prediction accuracy with F1-score (ranging from 0 to 1). F1-score is the harmonic mean of precision (positive predictive value) and recall (true positive rate or sensitivity). RESULTS Our large language model outperformed a conventional support vector machine-supervised machine learning approach by a wide margin. Compared to the support vector machine model, which had F1-scores less than 0.2 for most infrequent circumstances, our large language model approach achieved an F1-score of over 0.6 for 4 circumstances and 0.8 for 2 circumstances. CONCLUSIONS The use of a large language model approach shows promise. Researchers interested in using natural language processing to identify infrequent circumstances in narrative report data may benefit from large language models.
Collapse
Affiliation(s)
- Weipeng Zhou
- Department of Biomedical Informatics and Medical Education, School of Medicine, University of Washington, Seattle, WA, United States
| | - Laura C Prater
- Department of Psychiatry and Behavioral Health, University of Washington, Seattle, WA, United States
- Harborview Medical Center, School of Medicine, University of Washington, Seattle, WA, United States
| | - Evan V Goldstein
- Department of Population Health Sciences, University of Utah, Salt Lake City, UT, United States
| | - Stephen J Mooney
- Department of Epidemiology, School of Public Health, University of Washington, Seattle, WA, United States
| |
Collapse
|
29
|
Korolev V, Protsenko P. Accurate, interpretable predictions of materials properties within transformer language models. Patterns (N Y) 2023; 4:100803. [PMID: 37876904 PMCID: PMC10591138 DOI: 10.1016/j.patter.2023.100803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 06/06/2023] [Accepted: 07/04/2023] [Indexed: 10/26/2023]
Abstract
Property prediction accuracy has long been a key parameter of machine learning in materials informatics. Accordingly, advanced models showing state-of-the-art performance turn into highly parameterized black boxes missing interpretability. Here, we present an elegant way to make their reasoning transparent. Human-readable text-based descriptions automatically generated within a suite of open-source tools are proposed as materials representation. Transformer language models pretrained on 2 million peer-reviewed articles take as input well-known terms such as chemical composition, crystal symmetry, and site geometry. Our approach outperforms crystal graph networks by classifying four out of five analyzed properties if one considers all available reference data. Moreover, fine-tuned text-based models show high accuracy in the ultra-small data limit. Explanations of their internal machinery are produced using local interpretability techniques and are faithful and consistent with domain expert rationales. This language-centric framework makes accurate property predictions accessible to people without artificial-intelligence expertise.
Collapse
Affiliation(s)
- Vadim Korolev
- Department of Chemistry, Lomonosov Moscow State University, 119991 Moscow, Russia
| | - Pavel Protsenko
- Department of Chemistry, Lomonosov Moscow State University, 119991 Moscow, Russia
| |
Collapse
|
30
|
Beltrami EJ, Grant-Kels JM. Dermatology in the wake of an AI revolution: Who gets a say? J Am Acad Dermatol 2023; 89:e159-e160. [PMID: 37268021 DOI: 10.1016/j.jaad.2023.05.053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 05/22/2023] [Indexed: 06/04/2023]
Affiliation(s)
- Eric J Beltrami
- University of Connecticut School of Medicine, Farmington, Connecticut
| | - Jane M Grant-Kels
- Department of Dermatology, University of Connecticut Health Center, Farmington, Connecticut; Department of Dermatology, University of Florida, Gainesville, Florida.
| |
Collapse
|
31
|
Ferreira AL, Lipoff JB. The complex ethics of applying ChatGPT and language model artificial intelligence in dermatology. J Am Acad Dermatol 2023; 89:e157-e158. [PMID: 37263382 DOI: 10.1016/j.jaad.2023.05.054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 05/02/2023] [Accepted: 05/06/2023] [Indexed: 06/03/2023]
Affiliation(s)
- Alana Luna Ferreira
- Department of Dermatology, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania
| | - Jules B Lipoff
- Department of Dermatology, Lewis Katz School of Medicine, Temple University, Philadelphia, Pennsylvania.
| |
Collapse
|
32
|
Májovský M, Mikolov T, Netuka D. AI Is Changing the Landscape of Academic Writing: What Can Be Done? Authors' Reply to: AI Increases the Pressure to Overhaul the Scientific Peer Review Process. Comment on "Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora's Box Has Been Opened". J Med Internet Res 2023; 25:e50844. [PMID: 37651175 PMCID: PMC10502592 DOI: 10.2196/50844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 08/12/2023] [Indexed: 09/01/2023] Open
Affiliation(s)
- Martin Májovský
- Department of Neurosurgery and Neurooncology, First Faculty of Medicine, Charles University, Prague, Czech Republic
| | - Tomas Mikolov
- Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Prague, Czech Republic
| | - David Netuka
- Department of Neurosurgery and Neurooncology, First Faculty of Medicine, Charles University, Prague, Czech Republic
| |
Collapse
|
33
|
Liu N, Brown A. AI Increases the Pressure to Overhaul the Scientific Peer Review Process. Comment on "Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora's Box Has Been Opened". J Med Internet Res 2023; 25:e50591. [PMID: 37651167 PMCID: PMC10502600 DOI: 10.2196/50591] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 08/12/2023] [Indexed: 09/01/2023] Open
Affiliation(s)
- Nicholas Liu
- John A Burns School of Medicine, University of Hawai'i at Mānoa, Honolulu, HI, United States
| | - Amy Brown
- Department of Quantitative Health Sciences, John A Burns School of Medicine, University of Hawai'i at Mānoa, Honolulu, HI, United States
| |
Collapse
|
34
|
Hsu HY, Hsu KC, Hou SY, Wu CL, Hsieh YW, Cheng YD. Examining Real-World Medication Consultations and Drug-Herb Interactions: ChatGPT Performance Evaluation. JMIR Med Educ 2023; 9:e48433. [PMID: 37561097 PMCID: PMC10477918 DOI: 10.2196/48433] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/23/2023] [Accepted: 07/25/2023] [Indexed: 08/11/2023]
Abstract
BACKGROUND Since OpenAI released ChatGPT, with its strong capability in handling natural tasks and its user-friendly interface, it has garnered significant attention. OBJECTIVE A prospective analysis is required to evaluate the accuracy and appropriateness of medication consultation responses generated by ChatGPT. METHODS A prospective cross-sectional study was conducted by the pharmacy department of a medical center in Taiwan. The test data set comprised retrospective medication consultation questions collected from February 1, 2023, to February 28, 2023, along with common questions about drug-herb interactions. Two distinct sets of questions were tested: real-world medication consultation questions and common questions about interactions between traditional Chinese and Western medicines. We used the conventional double-review mechanism. The appropriateness of each response from ChatGPT was assessed by 2 experienced pharmacists. In the event of a discrepancy between the assessments, a third pharmacist stepped in to make the final decision. RESULTS Of 293 real-world medication consultation questions, a random selection of 80 was used to evaluate ChatGPT's performance. ChatGPT exhibited a higher appropriateness rate in responding to public medication consultation questions compared to those asked by health care providers in a hospital setting (31/51, 61% vs 20/51, 39%; P=.01). CONCLUSIONS The findings from this study suggest that ChatGPT could potentially be used for answering basic medication consultation questions. Our analysis of the erroneous information allowed us to identify potential medical risks associated with certain questions; this problem deserves our close attention.
Collapse
Affiliation(s)
- Hsing-Yu Hsu
- Department of Pharmacy, China Medical University Hospital, Taichung, Taiwan
- Graduate Institute of Clinical Pharmacy, College of Medicine, National Taiwan University, Taipei, Taiwan
| | - Kai-Cheng Hsu
- Artificial Intelligence Center, China Medical University Hospital, Taichung, Taiwan
- Department of Medicine, China Medical University, Taichung, Taiwan
| | - Shih-Yen Hou
- Artificial Intelligence Center, China Medical University Hospital, Taichung, Taiwan
| | - Ching-Lung Wu
- School of Pharmacy, College of Pharmacy, China Medical University, Taichung, Taiwan
| | - Yow-Wen Hsieh
- Department of Pharmacy, China Medical University Hospital, Taichung, Taiwan
- School of Pharmacy, College of Pharmacy, China Medical University, Taichung, Taiwan
| | - Yih-Dih Cheng
- Department of Pharmacy, China Medical University Hospital, Taichung, Taiwan
- School of Pharmacy, College of Pharmacy, China Medical University, Taichung, Taiwan
| |
Collapse
|
35
|
Borchert RJ, Hickman CR, Pepys J, Sadler TJ. Performance of ChatGPT on the Situational Judgement Test-A Professional Dilemmas-Based Examination for Doctors in the United Kingdom. JMIR Med Educ 2023; 9:e48978. [PMID: 37548997 PMCID: PMC10442724 DOI: 10.2196/48978] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 06/30/2023] [Accepted: 07/25/2023] [Indexed: 08/08/2023]
Abstract
BACKGROUND ChatGPT is a large language model that has performed well on professional examinations in the fields of medicine, law, and business. However, it is unclear how ChatGPT would perform on an examination assessing professionalism and situational judgement for doctors. OBJECTIVE We evaluated the performance of ChatGPT on the Situational Judgement Test (SJT): a national examination taken by all final-year medical students in the United Kingdom. This examination is designed to assess attributes such as communication, teamwork, patient safety, prioritization skills, professionalism, and ethics. METHODS All questions from the UK Foundation Programme Office's (UKFPO's) 2023 SJT practice examination were inputted into ChatGPT. For each question, ChatGPT's answers and rationales were recorded and assessed on the basis of the official UK Foundation Programme Office scoring template. Questions were categorized into domains of Good Medical Practice on the basis of the domains referenced in the rationales provided in the scoring sheet. Questions without clear domain links were screened by reviewers and assigned one or multiple domains. ChatGPT's overall performance, as well as its performance across the domains of Good Medical Practice, was evaluated. RESULTS Overall, ChatGPT performed well, scoring 76% on the SJT but scoring full marks on only a few questions (9%), which may reflect possible flaws in ChatGPT's situational judgement or inconsistencies in the reasoning across questions (or both) in the examination itself. ChatGPT demonstrated consistent performance across the 4 outlined domains in Good Medical Practice for doctors. CONCLUSIONS Further research is needed to understand the potential applications of large language models, such as ChatGPT, in medical education for standardizing questions and providing consistent rationales for examinations assessing professionalism and ethics.
Collapse
Affiliation(s)
- Robin J Borchert
- Department of Radiology, University of Cambridge, Cambridge, United Kingdom
- Department of Radiology, Addenbrooke's Hospital, Cambridge University Hospitals NHS Foundation Trust, Cambridge, United Kingdom
| | - Charlotte R Hickman
- Department of General Medicine, Lister Hospital, East and North Hertfordshire NHS Trust, Stevenage, United Kingdom
| | - Jack Pepys
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
| | - Timothy J Sadler
- Department of Radiology, Addenbrooke's Hospital, Cambridge University Hospitals NHS Foundation Trust, Cambridge, United Kingdom
| |
Collapse
|
36
|
Shaitarova A, Zaghir J, Lavelli A, Krauthammer M, Rinaldi F. Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey. Yearb Med Inform 2023; 32:230-243. [PMID: 38147865 PMCID: PMC10751112 DOI: 10.1055/s-0043-1768726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023] Open
Abstract
OBJECTIVES This survey aims to provide an overview of the current state of biomedical and clinical Natural Language Processing (NLP) research and practice in Languages other than English (LoE). We pay special attention to data resources, language models, and popular NLP downstream tasks. METHODS We explore the literature on clinical and biomedical NLP from the years 2020-2022, focusing on the challenges of multilinguality and LoE. We query online databases and manually select relevant publications. We also use recent NLP review papers to identify the possible information lacunae. RESULTS Our work confirms the recent trend towards the use of transformer-based language models for a variety of NLP tasks in medical domains. In addition, there has been an increase in the availability of annotated datasets for clinical NLP in LoE, particularly in European languages such as Spanish, German and French. Common NLP tasks addressed in medical NLP research in LoE include information extraction, named entity recognition, normalization, linking, and negation detection. However, there is still a need for the development of annotated datasets and models specifically tailored to the unique characteristics and challenges of medical text in some of these languages, especially low-resources ones. Lastly, this survey highlights the progress of medical NLP in LoE, and helps at identifying opportunities for future research and development in this field.
Collapse
Affiliation(s)
| | - Jamil Zaghir
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Alberto Lavelli
- Natural Language Processing Research Unit, Center for Digital Health and Wellbeing, Fondazione Bruno Kessler, Trento, Italy
| | - Michael Krauthammer
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
- Biomedical Informatics, University Hospital Zurich, Zurich, Switzerland
| | - Fabio Rinaldi
- Natural Language Processing Research Unit, Center for Digital Health and Wellbeing, Fondazione Bruno Kessler, Trento, Italy
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
- Dalle Molle Institute for Artificial Intelligence Research, Lugano, Switzerland
- Swiss Institute of Bioinformatics
| |
Collapse
|
37
|
Baker MN, Burruss CP, Wilson CL. ChatGPT: A Supplemental Tool for Efficiency and Improved Communication in Rural Dermatology. Cureus 2023; 15:e43812. [PMID: 37731429 PMCID: PMC10508964 DOI: 10.7759/cureus.43812] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/20/2023] [Indexed: 09/22/2023] Open
Abstract
In November 2022, OpenAI released version 3.5 of ChatGPT, the first publicly available artificial intelligence (AI) language model designed to engage in natural, human-like dialogue with users. While this groundbreaking technology has been extensively studied in various domains, its potential applications in rural dermatology remain unexplored in the existing literature. Our research investigates the many benefits that ChatGPT could offer in rural dermatology, particularly concerning administrative tasks and communication with communities with lower healthcare literacy. However, we also acknowledge that utilizing ChatGPT without proper caution and discernment may lead to potential drawbacks. This paper examines the opportunities and challenges associated with integrating ChatGPT into rural dermatology practices, ultimately fostering a well-informed and responsible approach to its implementation.
Collapse
Affiliation(s)
- Mindy N Baker
- Dermatology, University of Kentucky College of Medicine, Lexington, USA
| | - Clayton P Burruss
- Dermatology, University of Kentucky College of Medicine, Lexington, USA
| | | |
Collapse
|
38
|
Gala D, Makaryus AN. The Utility of Language Models in Cardiology: A Narrative Review of the Benefits and Concerns of ChatGPT-4. Int J Environ Res Public Health 2023; 20:6438. [PMID: 37568980 PMCID: PMC10419098 DOI: 10.3390/ijerph20156438] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Revised: 06/30/2023] [Accepted: 07/19/2023] [Indexed: 08/13/2023]
Abstract
Artificial intelligence (AI) and language models such as ChatGPT-4 (Generative Pretrained Transformer) have made tremendous advances recently and are rapidly transforming the landscape of medicine. Cardiology is among many of the specialties that utilize AI with the intention of improving patient care. Generative AI, with the use of its advanced machine learning algorithms, has the potential to diagnose heart disease and recommend management options suitable for the patient. This may lead to improved patient outcomes not only by recommending the best treatment plan but also by increasing physician efficiency. Language models could assist physicians with administrative tasks, allowing them to spend more time on patient care. However, there are several concerns with the use of AI and language models in the field of medicine. These technologies may not be the most up-to-date with the latest research and could provide outdated information, which may lead to an adverse event. Secondly, AI tools can be expensive, leading to increased healthcare costs and reduced accessibility to the general population. There is also concern about the loss of the human touch and empathy as AI becomes more mainstream. Healthcare professionals would need to be adequately trained to utilize these tools. While AI and language models have many beneficial traits, all healthcare providers need to be involved and aware of generative AI so as to assure its optimal use and mitigate any potential risks and challenges associated with its implementation. In this review, we discuss the various uses of language models in the field of cardiology.
Collapse
Affiliation(s)
- Dhir Gala
- Department of Clinical Science, American University of the Caribbean School of Medicine, Cupecoy, Sint Maarten, The Netherlands;
| | - Amgad N. Makaryus
- Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hofstra University, 500 Hofstra Blvd., Hempstead, NY 11549, USA
- Department of Cardiology, Nassau University Medical Center, Hempstead, NY 11554, USA
| |
Collapse
|
39
|
Abstract
Large language models (LLMs) are one of the most impressive achievements of artificial intelligence in recent years. However, their relevance to the study of language more broadly remains unclear. This article considers the potential of LLMs to serve as models of language understanding in humans. While debate on this question typically centres around models' performance on challenging language understanding tasks, this article argues that the answer depends on models' underlying competence, and thus that the focus of the debate should be on empirical work which seeks to characterize the representations and processing algorithms that underlie model behaviour. From this perspective, the article offers counterarguments to two commonly cited reasons why LLMs cannot serve as plausible models of language in humans: their lack of symbolic structure and their lack of grounding. For each, a case is made that recent empirical trends undermine the common assumptions about LLMs, and thus that it is premature to draw conclusions about LLMs' ability (or lack thereof) to offer insights on human language representation and understanding. This article is part of a discussion meeting issue 'Cognitive artificial intelligence'.
Collapse
Affiliation(s)
- Ellie Pavlick
- Department of Computer Science, Brown University, Providence, RI, USA
| |
Collapse
|
40
|
Beaulieu-Jones BR, Shah S, Berrigan MT, Marwaha JS, Lai SL, Brat GA. Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments. medRxiv 2023:2023.07.16.23292743. [PMID: 37502981 PMCID: PMC10371188 DOI: 10.1101/2023.07.16.23292743] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Background Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on several professional and academic benchmarks. We sought to probe its performance and stability over time on surgical case questions. Methods We evaluated the performance of ChatGPT-4 on two surgical knowledge assessments: the Surgical Council on Resident Education (SCORE) and a second commonly used knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended and multiple choice. ChatGPT output were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat encounters. Results A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for inaccurate responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of inaccurate questions; the response accuracy changed for 6/16 questions. Conclusion Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care.
Collapse
Affiliation(s)
- Brendin R Beaulieu-Jones
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA
| | - Sahaj Shah
- Geisinger Commonwealth School of Medicine, Scranton, PA
| | | | - Jayson S Marwaha
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Shuo-Lun Lai
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Gabriel A Brat
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA
| |
Collapse
|
41
|
Mottin L, Goldman JP, Jäggli C, Achermann R, Gobeill J, Knafou J, Ehrsam J, Wicky A, Gérard CL, Schwenk T, Charrier M, Tsantoulis P, Lovis C, Leichtle A, Kiessling MK, Michielin O, Pradervand S, Foufi V, Ruch P. Multilingual RECIST classification of radiology reports using supervised learning. Front Digit Health 2023; 5:1195017. [PMID: 37388252 PMCID: PMC10303934 DOI: 10.3389/fdgth.2023.1195017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 06/05/2023] [Indexed: 07/01/2023] Open
Abstract
Objectives The objective of this study is the exploration of Artificial Intelligence and Natural Language Processing techniques to support the automatic assignment of the four Response Evaluation Criteria in Solid Tumors (RECIST) scales based on radiology reports. We also aim at evaluating how languages and institutional specificities of Swiss teaching hospitals are likely to affect the quality of the classification in French and German languages. Methods In our approach, 7 machine learning methods were evaluated to establish a strong baseline. Then, robust models were built, fine-tuned according to the language (French and German), and compared with the expert annotation. Results The best strategies yield average F1-scores of 90% and 86% respectively for the 2-classes (Progressive/Non-progressive) and the 4-classes (Progressive Disease, Stable Disease, Partial Response, Complete Response) RECIST classification tasks. Conclusions These results are competitive with the manual labeling as measured by Matthew's correlation coefficient and Cohen's Kappa (79% and 76%). On this basis, we confirm the capacity of specific models to generalize on new unseen data and we assess the impact of using Pre-trained Language Models (PLMs) on the accuracy of the classifiers.
Collapse
Affiliation(s)
- Luc Mottin
- HES-SO\HEG Genève, Information Sciences, Geneva, Switzerland
- SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Jean-Philippe Goldman
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland
| | - Christoph Jäggli
- Inselspital – Bern University Hospital and University of Bern, Bern, Switzerland
| | - Rita Achermann
- Department of Radiology, Clinic of Radiology & Nuclear Medicine, University Hospital Basel, University of Basel, Basel, Switzerland
| | - Julien Gobeill
- HES-SO\HEG Genève, Information Sciences, Geneva, Switzerland
- SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- HES-SO\HEG Genève, Information Sciences, Geneva, Switzerland
- SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Ehrsam
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Alexandre Wicky
- Precision Oncology Center, Oncology Department, Centre Hospitalier Universitaire Vaudois – CHUV, Lausanne, Switzerland
| | - Camille L. Gérard
- Precision Oncology Center, Oncology Department, Centre Hospitalier Universitaire Vaudois – CHUV, Lausanne, Switzerland
| | - Tanja Schwenk
- Department of Oncology, Kantonsspital Aarau, Aarau, Switzerland
| | - Mélinda Charrier
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland
| | - Petros Tsantoulis
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Christian Lovis
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Alexander Leichtle
- Inselspital – Bern University Hospital and University of Bern, Bern, Switzerland
| | - Michael K. Kiessling
- Department of Medical Oncology and Hematology, University Hospital Zurich, Zurich, Switzerland
| | - Olivier Michielin
- Precision Oncology Center, Oncology Department, Centre Hospitalier Universitaire Vaudois – CHUV, Lausanne, Switzerland
| | - Sylvain Pradervand
- Precision Oncology Center, Oncology Department, Centre Hospitalier Universitaire Vaudois – CHUV, Lausanne, Switzerland
| | - Vasiliki Foufi
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland
| | - Patrick Ruch
- HES-SO\HEG Genève, Information Sciences, Geneva, Switzerland
- SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
42
|
Májovský M, Černý M, Kasal M, Komarc M, Netuka D. Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora's Box Has Been Opened. J Med Internet Res 2023; 25:e46924. [PMID: 37256685 PMCID: PMC10267787 DOI: 10.2196/46924] [Citation(s) in RCA: 33] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 04/25/2023] [Accepted: 05/03/2023] [Indexed: 06/01/2023] Open
Abstract
BACKGROUND Artificial intelligence (AI) has advanced substantially in recent years, transforming many industries and improving the way people live and work. In scientific research, AI can enhance the quality and efficiency of data analysis and publication. However, AI has also opened up the possibility of generating high-quality fraudulent papers that are difficult to detect, raising important questions about the integrity of scientific research and the trustworthiness of published papers. OBJECTIVE The aim of this study was to investigate the capabilities of current AI language models in generating high-quality fraudulent medical articles. We hypothesized that modern AI models can create highly convincing fraudulent papers that can easily deceive readers and even experienced researchers. METHODS This proof-of-concept study used ChatGPT (Chat Generative Pre-trained Transformer) powered by the GPT-3 (Generative Pre-trained Transformer 3) language model to generate a fraudulent scientific article related to neurosurgery. GPT-3 is a large language model developed by OpenAI that uses deep learning algorithms to generate human-like text in response to prompts given by users. The model was trained on a massive corpus of text from the internet and is capable of generating high-quality text in a variety of languages and on various topics. The authors posed questions and prompts to the model and refined them iteratively as the model generated the responses. The goal was to create a completely fabricated article including the abstract, introduction, material and methods, discussion, references, charts, etc. Once the article was generated, it was reviewed for accuracy and coherence by experts in the fields of neurosurgery, psychiatry, and statistics and compared to existing similar articles. RESULTS The study found that the AI language model can create a highly convincing fraudulent article that resembled a genuine scientific paper in terms of word usage, sentence structure, and overall composition. The AI-generated article included standard sections such as introduction, material and methods, results, and discussion, as well a data sheet. It consisted of 1992 words and 17 citations, and the whole process of article creation took approximately 1 hour without any special training of the human user. However, there were some concerns and specific mistakes identified in the generated article, specifically in the references. CONCLUSIONS The study demonstrates the potential of current AI language models to generate completely fabricated scientific articles. Although the papers look sophisticated and seemingly flawless, expert readers may identify semantic inaccuracies and errors upon closer inspection. We highlight the need for increased vigilance and better detection methods to combat the potential misuse of AI in scientific research. At the same time, it is important to recognize the potential benefits of using AI language models in genuine scientific writing and research, such as manuscript preparation and language editing.
Collapse
Affiliation(s)
- Martin Májovský
- Department of Neurosurgery and Neurooncology, First Faculty of Medicine, Charles University, Prague, Czech Republic
| | - Martin Černý
- Department of Neurosurgery and Neurooncology, First Faculty of Medicine, Charles University, Prague, Czech Republic
| | - Matěj Kasal
- Department of Psychiatry, Faculty of Medicine in Pilsen, Charles University, Pilsen, Czech Republic
| | - Martin Komarc
- Institute of Biophysics and Informatics, First Faculty of Medicine, Charles University, Prague, Czech Republic
- Department of Methodology, Faculty of Physical Education and Sport, Charles University, Prague, Czech Republic
| | - David Netuka
- Department of Neurosurgery and Neurooncology, First Faculty of Medicine, Charles University, Prague, Czech Republic
| |
Collapse
|
43
|
Bommasani R, Liang P, Lee T. Holistic Evaluation of Language Models. Ann N Y Acad Sci 2023. [PMID: 37230490 DOI: 10.1111/nyas.15007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Language models (LMs) like GPT-3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of LMs. LMs can serve many purposes and their behavior should satisfy many desiderata. To navigate the vast space of potential scenarios and metrics, we taxonomize the space and select representative subsets. We evaluate models on 16 core scenarios and 7 metrics, exposing important trade-offs. We supplement our core evaluation with seven targeted evaluations to deeply analyze specific aspects (including world knowledge, reasoning, regurgitation of copyrighted content, and generation of disinformation). We benchmark 30 LMs, from OpenAI, Microsoft, Google, Meta, Cohere, AI21 Labs, and others. Prior to HELM, models were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: all 30 models are now benchmarked under the same standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly. HELM is a living benchmark for the community, continuously updated with new scenarios, metrics, and models https://crfm.stanford.edu/helm/latest/.
Collapse
Affiliation(s)
- Rishi Bommasani
- Center for Research on Foundation Models, Stanford University, Stanford, California, USA
| | - Percy Liang
- Center for Research on Foundation Models, Stanford University, Stanford, California, USA
| | - Tony Lee
- Center for Research on Foundation Models, Stanford University, Stanford, California, USA
| |
Collapse
|
44
|
Richter-Pechanski P, Wiesenbach P, Schwab DM, Kiriakou C, He M, Geis NA, Frank A, Dieterich C. Few-Shot and Prompt Training for Text Classification in German Doctor's Letters. Stud Health Technol Inform 2023; 302:819-820. [PMID: 37203504 DOI: 10.3233/shti230275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
To classify sentences in cardiovascular German doctor's letters into eleven section categories, we used pattern-exploiting training, a prompt-based method for text classification in few-shot learning scenarios (20, 50 and 100 instances per class) using language models with various pre-training approaches evaluated on CARDIO:DE, a freely available German clinical routine corpus. Prompting improves results by 5-28% accuracy compared to traditional methods, reducing manual annotation efforts and computational costs in a clinical setting.
Collapse
Affiliation(s)
- Phillip Richter-Pechanski
- Klaus Tschira Institute for Computational Cardiology, Heidelberg, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Germany
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Germany
- Informatics for Life, Heidelberg, Germany
- Department of Computational Linguistics, Heidelberg University, Germany
| | - Philipp Wiesenbach
- Klaus Tschira Institute for Computational Cardiology, Heidelberg, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Germany
- Informatics for Life, Heidelberg, Germany
| | - Dominic M Schwab
- Department of Internal Medicine III, University Hospital Heidelberg, Germany
| | - Christina Kiriakou
- Department of Internal Medicine III, University Hospital Heidelberg, Germany
| | - Mingyang He
- Klaus Tschira Institute for Computational Cardiology, Heidelberg, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Germany
- Department of Computational Linguistics, Heidelberg University, Germany
| | - Nicolas A Geis
- Department of Internal Medicine III, University Hospital Heidelberg, Germany
- Informatics for Life, Heidelberg, Germany
| | - Anette Frank
- Department of Computational Linguistics, Heidelberg University, Germany
| | - Christoph Dieterich
- Klaus Tschira Institute for Computational Cardiology, Heidelberg, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Germany
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Germany
- Informatics for Life, Heidelberg, Germany
| |
Collapse
|
45
|
Dillion D, Tandon N, Gu Y, Gray K. Can AI language models replace human participants? Trends Cogn Sci 2023:S1364-6613(23)00098-0. [PMID: 37173156 DOI: 10.1016/j.tics.2023.04.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2023] [Revised: 04/15/2023] [Accepted: 04/17/2023] [Indexed: 05/15/2023]
Abstract
Recent work suggests that language models such as GPT can make human-like judgments across a number of domains. We explore whether and when language models might replace human participants in psychological science. We review nascent research, provide a theoretical model, and outline caveats of using AI as a participant.
Collapse
Affiliation(s)
- Danica Dillion
- University of North Carolina, Department of Psychology and Neuroscience, Chapel Hill, NC 27599-3270, USA
| | | | - Yuling Gu
- Allen Institute for AI, Seattle, WA 98103, USA
| | - Kurt Gray
- University of North Carolina, Department of Psychology and Neuroscience, Chapel Hill, NC 27599-3270, USA.
| |
Collapse
|
46
|
Kauf C, Tuckute G, Levy R, Andreas J, Fedorenko E. Lexical semantic content, not syntactic structure, is the main contributor to ANN-brain similarity of fMRI responses in the language network. bioRxiv 2023:2023.05.05.539646. [PMID: 37205405 PMCID: PMC10187317 DOI: 10.1101/2023.05.05.539646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Representations from artificial neural network (ANN) language models have been shown to predict human brain activity in the language network. To understand what aspects of linguistic stimuli contribute to ANN-to-brain similarity, we used an fMRI dataset of responses to n=627 naturalistic English sentences (Pereira et al., 2018) and systematically manipulated the stimuli for which ANN representations were extracted. In particular, we i) perturbed sentences' word order, ii) removed different subsets of words, or iii) replaced sentences with other sentences of varying semantic similarity. We found that the lexical semantic content of the sentence (largely carried by content words) rather than the sentence's syntactic form (conveyed via word order or function words) is primarily responsible for the ANN-to-brain similarity. In follow-up analyses, we found that perturbation manipulations that adversely affect brain predictivity also lead to more divergent representations in the ANN's embedding space and decrease the ANN's ability to predict upcoming tokens in those stimuli. Further, results are robust to whether the mapping model is trained on intact or perturbed stimuli, and whether the ANN sentence representations are conditioned on the same linguistic context that humans saw. The critical result-that lexical-semantic content is the main contributor to the similarity between ANN representations and neural ones-aligns with the idea that the goal of the human language system is to extract meaning from linguistic strings. Finally, this work highlights the strength of systematic experimental manipulations for evaluating how close we are to accurate and generalizable models of the human language network.
Collapse
Affiliation(s)
- Carina Kauf
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology
- McGovern Institute for Brain Research, Massachusetts Institute of Technology
| | - Greta Tuckute
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology
- McGovern Institute for Brain Research, Massachusetts Institute of Technology
| | - Roger Levy
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology
| | - Jacob Andreas
- Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology
| | - Evelina Fedorenko
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology
- McGovern Institute for Brain Research, Massachusetts Institute of Technology
- Program in Speech and Hearing Bioscience and Technology, Harvard University
| |
Collapse
|
47
|
Owen D, Antypas D, Hassoulas A, Pardiñas AF, Espinosa-Anke L, Collados JC. Enabling Early Health Care Intervention by Detecting Depression in Users of Web-Based Forums using Language Models: Longitudinal Analysis and Evaluation. JMIR AI 2023; 2:e41205. [PMID: 37525646 PMCID: PMC7614849 DOI: 10.2196/41205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/02/2023]
Abstract
Background Major depressive disorder is a common mental disorder affecting 5% of adults worldwide. Early contact with health care services is critical for achieving accurate diagnosis and improving patient outcomes. Key symptoms of major depressive disorder (depression hereafter) such as cognitive distortions are observed in verbal communication, which can also manifest in the structure of written language. Thus, the automatic analysis of text outputs may provide opportunities for early intervention in settings where written communication is rich and regular, such as social media and web-based forums. Objective The objective of this study was 2-fold. We sought to gauge the effectiveness of different machine learning approaches to identify users of the mass web-based forum Reddit, who eventually disclose a diagnosis of depression. We then aimed to determine whether the time between a forum post and a depression diagnosis date was a relevant factor in performing this detection. Methods A total of 2 Reddit data sets containing posts belonging to users with and without a history of depression diagnosis were obtained. The intersection of these data sets provided users with an estimated date of depression diagnosis. This derived data set was used as an input for several machine learning classifiers, including transformer-based language models (LMs). Results Bidirectional Encoder Representations from Transformers (BERT) and MentalBERT transformer-based LMs proved the most effective in distinguishing forum users with a known depression diagnosis from those without. They each obtained a mean F1-score of 0.64 across the experimental setups used for binary classification. The results also suggested that the final 12 to 16 weeks (about 3-4 months) of posts before a depressed user's estimated diagnosis date are the most indicative of their illness, with data before that period not helping the models detect more accurately. Furthermore, in the 4- to 8-week period before the user's estimated diagnosis date, their posts exhibited more negative sentiment than any other 4-week period in their post history. Conclusions Transformer-based LMs may be used on data from web-based social media forums to identify users at risk for psychiatric conditions such as depression. Language features picked up by these classifiers might predate depression onset by weeks to months, enabling proactive mental health care interventions to support those at risk for this condition.
Collapse
Affiliation(s)
- David Owen
- School of Computer Science and Informatics, Cardiff University,
Cardiff, United Kingdom
| | - Dimosthenis Antypas
- School of Computer Science and Informatics, Cardiff University,
Cardiff, United Kingdom
| | - Athanasios Hassoulas
- Centre for Medical Education, School of Medicine, Cardiff
University, Cardiff, United Kingdom
| | - Antonio F Pardiñas
- Centre for Neuropsychiatric Genetics and Genomics, School of
Medicine, Cardiff University, Cardiff, United Kingdom
| | - Luis Espinosa-Anke
- School of Computer Science and Informatics, Cardiff University,
Cardiff, United Kingdom
| | - Jose Camacho Collados
- School of Computer Science and Informatics, Cardiff University,
Cardiff, United Kingdom
| |
Collapse
|
48
|
Abstract
We study GPT-3, a recent large language model, using tools from cognitive psychology. More specifically, we assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities on a battery of canonical experiments from the literature. We find that much of GPT-3's behavior is impressive: It solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multiarmed bandit task, and shows signatures of model-based reinforcement learning. Yet, we also find that small perturbations to vignette-based tasks can lead GPT-3 vastly astray, that it shows no signatures of directed exploration, and that it fails miserably in a causal reasoning task. Taken together, these results enrich our understanding of current large language models and pave the way for future investigations using tools from cognitive psychology to study increasingly capable and opaque artificial agents.
Collapse
|
49
|
Salicchi L, Chersoni E, Lenci A. A study on surprisal and semantic relatedness for eye-tracking data prediction. Front Psychol 2023; 14:1112365. [PMID: 36818086 PMCID: PMC9931754 DOI: 10.3389/fpsyg.2023.1112365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 01/13/2023] [Indexed: 02/05/2023] Open
Abstract
Previous research in computational linguistics dedicated a lot of effort to using language modeling and/or distributional semantic models to predict metrics extracted from eye-tracking data. However, it is not clear whether the two components have a distinct contribution, with recent studies claiming that surprisal scores estimated with large-scale, deep learning-based language models subsume the semantic relatedness component. In our study, we propose a regression experiment for estimating different eye-tracking metrics on two English corpora, contrasting the quality of the predictions with and without the surprisal and the relatedness components. Different types of relatedness scores derived from both static and contextual models have also been tested. Our results suggest that both components play a role in the prediction, with semantic relatedness surprisingly contributing also to the prediction of function words. Moreover, they show that when the metric is computed with the contextual embeddings of the BERT model, it is able to explain a higher amount of variance.
Collapse
Affiliation(s)
- Lavinia Salicchi
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China,*Correspondence: Lavinia Salicchi ✉
| | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| | - Alessandro Lenci
- Computational Linguistics Laboratory (CoLing Lab), University of Pisa, Pisa, Italy
| |
Collapse
|
50
|
Luo M, Li S, Pang Y, Yao L, Ma R, Huang HY, Huang HD, Lee TY. Extraction of microRNA-target interaction sentences from biomedical literature by deep learning approach. Brief Bioinform 2023; 24:6847797. [PMID: 36440972 DOI: 10.1093/bib/bbac497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 10/16/2022] [Accepted: 10/19/2022] [Indexed: 11/29/2022] Open
Abstract
MicroRNA (miRNA)-target interaction (MTI) plays a substantial role in various cell activities, molecular regulations and physiological processes. Published biomedical literature is the carrier of high-confidence MTI knowledge. However, digging out this knowledge in an efficient manner from large-scale published articles remains challenging. To address this issue, we were motivated to construct a deep learning-based model. We applied the pre-trained language models to biomedical text to obtain the representation, and subsequently fed them into a deep neural network with gate mechanism layers and a fully connected layer for the extraction of MTI information sentences. Performances of the proposed models were evaluated using two datasets constructed on the basis of text data obtained from miRTarBase. The validation and test results revealed that incorporating both PubMedBERT and SciBERT for sentence level encoding with the long short-term memory (LSTM)-based deep neural network can yield an outstanding performance, with both F1 and accuracy being higher than 80% on validation data and test data. Additionally, the proposed deep learning method outperformed the following machine learning methods: random forest, support vector machine, logistic regression and bidirectional LSTM. This work would greatly facilitate studies on MTI analysis and regulations. It is anticipated that this work can assist in large-scale screening of miRNAs, thereby revealing their functional roles in various diseases, which is important for the development of highly specific drugs with fewer side effects. Source code and corpus are publicly available at https://github.com/qi29.
Collapse
Affiliation(s)
- Mengqi Luo
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China; School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Shangfu Li
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen
| | - Yuxuan Pang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China, and also in the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, PR China
| | - Lantian Yao
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China, and also in the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, PR China
| | - Renfei Ma
- Warshel Institute for Computational Biology, Chinese University of Hong Kong, Shenzhen; School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Hsi-Yuan Huang
- School of Medicine and the Warshel Institute of Computational Biology, The Chinese University of Hong Kong, Shenzhen
| | - Hsien-Da Huang
- School of Medicine, and the executive director of Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China
| |
Collapse
|