1
|
Kosinski M. Evaluating large language models in theory of mind tasks. Proc Natl Acad Sci U S A 2024; 121:e2405460121. [PMID: 39471222 PMCID: PMC11551352 DOI: 10.1073/pnas.2405460121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2024] [Accepted: 09/23/2024] [Indexed: 11/01/2024] Open
Abstract
Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by-product of LLMs' improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
Collapse
Affiliation(s)
- Michal Kosinski
- Graduate School of Business, Stanford University, Stanford, CA94305
| |
Collapse
|
2
|
Cao X, Kosinski M. Large language models and humans converge in judging public figures' personalities. PNAS NEXUS 2024; 3:pgae418. [PMID: 39359393 PMCID: PMC11443023 DOI: 10.1093/pnasnexus/pgae418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 09/12/2024] [Indexed: 10/04/2024]
Abstract
ChatGPT-4 and 600 human raters evaluated 226 public figures' personalities using the Ten-Item Personality Inventory. The correlation between ChatGPT-4 and aggregate human ratings ranged from r = 0.76 to 0.87, outperforming the models specifically trained to make such predictions. Notably, the model was not provided with any training data or feedback on its performance. We discuss the potential explanations and practical implications of ChatGPT-4's ability to mimic human responses accurately.
Collapse
Affiliation(s)
- Xubo Cao
- Graduate School of Business, Stanford University, Stanford, CA 94305, USA
| | - Michal Kosinski
- Graduate School of Business, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
3
|
Dennstädt F, Zink J, Putora PM, Hastings J, Cihoric N. Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev 2024; 13:158. [PMID: 38879534 PMCID: PMC11180407 DOI: 10.1186/s13643-024-02575-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/17/2023] [Accepted: 05/30/2024] [Indexed: 06/19/2024] Open
Abstract
BACKGROUND Systematically screening published literature to determine the relevant publications to synthesize in a review is a time-consuming and difficult task. Large language models (LLMs) are an emerging technology with promising capabilities for the automation of language-related tasks that may be useful for such a purpose. METHODS LLMs were used as part of an automated system to evaluate the relevance of publications to a certain topic based on defined criteria and based on the title and abstract of each publication. A Python script was created to generate structured prompts consisting of text strings for instruction, title, abstract, and relevant criteria to be provided to an LLM. The relevance of a publication was evaluated by the LLM on a Likert scale (low relevance to high relevance). By specifying a threshold, different classifiers for inclusion/exclusion of publications could then be defined. The approach was used with four different openly available LLMs on ten published data sets of biomedical literature reviews and on a newly human-created data set for a hypothetical new systematic literature review. RESULTS The performance of the classifiers varied depending on the LLM being used and on the data set analyzed. Regarding sensitivity/specificity, the classifiers yielded 94.48%/31.78% for the FlanT5 model, 97.58%/19.12% for the OpenHermes-NeuralChat model, 81.93%/75.19% for the Mixtral model and 97.58%/38.34% for the Platypus 2 model on the ten published data sets. The same classifiers yielded 100% sensitivity at a specificity of 12.58%, 4.54%, 62.47%, and 24.74% on the newly created data set. Changing the standard settings of the approach (minor adaption of instruction prompt and/or changing the range of the Likert scale from 1-5 to 1-10) had a considerable impact on the performance. CONCLUSIONS LLMs can be used to evaluate the relevance of scientific publications to a certain review topic and classifiers based on such an approach show some promising results. To date, little is known about how well such systems would perform if used prospectively when conducting systematic literature reviews and what further implications this might have. However, it is likely that in the future researchers will increasingly use LLMs for evaluating and classifying scientific publications.
Collapse
Affiliation(s)
- Fabio Dennstädt
- Department of Radiation Oncology, Cantonal Hospital of St. Gallen, St. Gallen, Switzerland.
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.
| | - Johannes Zink
- Institute for Computer Science, University of Würzburg, Würzburg, Germany
| | - Paul Martin Putora
- Department of Radiation Oncology, Cantonal Hospital of St. Gallen, St. Gallen, Switzerland
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland
| | - Janna Hastings
- Institute for Implementation Science in Health Care, University of Zurich, Zurich, Switzerland
- School of Medicine, University of St. Gallen, St. Gallen, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nikola Cihoric
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland
| |
Collapse
|
4
|
Peters H, Matz SC. Large language models can infer psychological dispositions of social media users. PNAS NEXUS 2024; 3:pgae231. [PMID: 38948324 PMCID: PMC11211928 DOI: 10.1093/pnasnexus/pgae231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Accepted: 05/28/2024] [Indexed: 07/02/2024]
Abstract
Large language models (LLMs) demonstrate increasingly human-like abilities across a wide variety of tasks. In this paper, we investigate whether LLMs like ChatGPT can accurately infer the psychological dispositions of social media users and whether their ability to do so varies across socio-demographic groups. Specifically, we test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario. Our results show an average correlation of r = 0.29 ( range = [ 0.22 , 0.33 ] ) between LLM-inferred and self-reported trait scores-a level of accuracy that is similar to that of supervised machine learning models specifically trained to infer personality. Our findings also highlight heterogeneity in the accuracy of personality inferences across different age groups and gender categories: predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression. The ability of LLMs to infer psychological dispositions from user-generated text has the potential to democratize access to cheap and scalable psychometric assessments for both researchers and practitioners. On the one hand, this democratization might facilitate large-scale research of high ecological validity and spark innovation in personalized services. On the other hand, it also raises ethical concerns regarding user privacy and self-determination, highlighting the need for stringent ethical frameworks and regulation.
Collapse
Affiliation(s)
- Heinrich Peters
- Columbia Business School, Columbia University, New York, NY 10027, USA
| | - Sandra C Matz
- Columbia Business School, Columbia University, New York, NY 10027, USA
| |
Collapse
|
5
|
Qu Y, Wei C, Du P, Che W, Zhang C, Ouyang W, Bian Y, Xu F, Hu B, Du K, Wu H, Liu J, Liu Q. Integration of cognitive tasks into artificial general intelligence test for large models. iScience 2024; 27:109550. [PMID: 38595796 PMCID: PMC11001637 DOI: 10.1016/j.isci.2024.109550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024] Open
Abstract
During the evolution of large models, performance evaluation is necessary for assessing their capabilities. However, current model evaluations mainly rely on specific tasks and datasets, lacking a united framework for assessing the multidimensional intelligence of large models. In this perspective, we advocate for a comprehensive framework of cognitive science-inspired artificial general intelligence (AGI) tests, including crystallized, fluid, social, and embodied intelligence. The AGI tests consist of well-designed cognitive tests adopted from human intelligence tests, and then naturally encapsulates into an immersive virtual community. We propose increasing the complexity of AGI testing tasks commensurate with advancements in large models and emphasizing the necessity for the interpretation of test results to avoid false negatives and false positives. We believe that cognitive science-inspired AGI tests will effectively guide the targeted improvement of large models in specific dimensions of intelligence and accelerate the integration of large models into human society.
Collapse
Affiliation(s)
- Youzhi Qu
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Chen Wei
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Penghui Du
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Wenxin Che
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Chi Zhang
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | | | | | - Feiyang Xu
- iFLYTEK AI Research, Hefei 230088, China
| | - Bin Hu
- School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Kai Du
- Institute for Artificial Intelligence, Peking University, Beijing 100871, China
| | - Haiyan Wu
- Centre for Cognitive and Brain Sciences and Department of Psychology, University of Macau, Macau 999078, China
| | - Jia Liu
- Department of Psychology, Tsinghua University, Beijing 100084, China
| | - Quanying Liu
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| |
Collapse
|
6
|
Cao X, Kosinski M. Large language models know how the personality of public figures is perceived by the general public. Sci Rep 2024; 14:6735. [PMID: 38509191 PMCID: PMC10954708 DOI: 10.1038/s41598-024-57271-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 03/15/2024] [Indexed: 03/22/2024] Open
Abstract
We show that people's perceptions of public figures' personalities can be accurately predicted from their names' location in GPT-3's semantic space. We collected Big Five personality perceptions of 226 public figures from 600 human raters. Cross-validated linear regression was used to predict human perceptions from public figures' name embeddings extracted from GPT-3. The models' accuracy ranged from r = .78 to .88 without controls and from r = .53 to .70 when controlling for public figures' likability and demographics, after correcting for attenuation. Prediction models showed high face validity as revealed by the personality-descriptive adjectives occupying their extremes. Our findings reveal that GPT-3 word embeddings capture signals pertaining to individual differences and intimate traits.
Collapse
Affiliation(s)
- Xubo Cao
- Stanford University, Stanford, USA.
| | | |
Collapse
|
7
|
Dennstädt F, Hastings J, Putora PM, Vu E, Fischer GF, Süveg K, Glatzer M, Riggenbach E, Hà HL, Cihoric N. Exploring Capabilities of Large Language Models such as ChatGPT in Radiation Oncology. Adv Radiat Oncol 2024; 9:101400. [PMID: 38304112 PMCID: PMC10831180 DOI: 10.1016/j.adro.2023.101400] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/16/2023] [Indexed: 02/03/2024] Open
Abstract
Purpose Technological progress of machine learning and natural language processing has led to the development of large language models (LLMs), capable of producing well-formed text responses and providing natural language access to knowledge. Modern conversational LLMs such as ChatGPT have shown remarkable capabilities across a variety of fields, including medicine. These models may assess even highly specialized medical knowledge within specific disciplines, such as radiation therapy. We conducted an exploratory study to examine the capabilities of ChatGPT to answer questions in radiation therapy. Methods and Materials A set of multiple-choice questions about clinical, physics, and biology general knowledge in radiation oncology as well as a set of open-ended questions were created. These were given as prompts to the LLM ChatGPT, and the answers were collected and analyzed. For the multiple-choice questions, it was checked how many of the answers of the model could be clearly assigned to one of the allowed multiple-choice-answers, and the proportion of correct answers was determined. For the open-ended questions, independent blinded radiation oncologists evaluated the quality of the answers regarding correctness and usefulness on a 5-point Likert scale. Furthermore, the evaluators were asked to provide suggestions for improving the quality of the answers. Results For 70 multiple-choice questions, ChatGPT gave valid answers in 66 cases (94.3%). In 60.61% of the valid answers, the selected answer was correct (50.0% of clinical questions, 78.6% of physics questions, and 58.3% of biology questions). For 25 open-ended questions, 12 answers of ChatGPT were considered as "acceptable," "good," or "very good" regarding both correctness and helpfulness by all 6 participating radiation oncologists. Overall, the answers were considered "very good" in 29.3% and 28%, "good" in 28% and 29.3%, "acceptable" in 19.3% and 19.3%, "bad" in 9.3% and 9.3%, and "very bad" in 14% and 14% regarding correctness/helpfulness. Conclusions Modern conversational LLMs such as ChatGPT can provide satisfying answers to many relevant questions in radiation therapy. As they still fall short of consistently providing correct information, it is problematic to use them for obtaining medical information. As LLMs will further improve in the future, they are expected to have an increasing impact not only on general society, but also on clinical practice, including radiation oncology.
Collapse
Affiliation(s)
- Fabio Dennstädt
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Janna Hastings
- School of Medicine, University of St. Gallen, St. Gallen, Switzerland
- Institute for Implementation Science in Health Care, University of Zurich, Zurich, Switzerland
| | - Paul Martin Putora
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Switzerland
| | - Erwin Vu
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Galina F. Fischer
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Krisztian Süveg
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Markus Glatzer
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Elena Riggenbach
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Switzerland
| | - Hông-Linh Hà
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Switzerland
| | - Nikola Cihoric
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Switzerland
| |
Collapse
|
8
|
Derbal Y. Adaptive Cancer Therapy in the Age of Generative Artificial Intelligence. Cancer Control 2024; 31:10732748241264704. [PMID: 38897721 PMCID: PMC11189021 DOI: 10.1177/10732748241264704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 05/17/2024] [Accepted: 06/06/2024] [Indexed: 06/21/2024] Open
Abstract
Therapeutic resistance is a major challenge facing the design of effective cancer treatments. Adaptive cancer therapy is in principle the most viable approach to manage cancer's adaptive dynamics through drug combinations with dose timing and modulation. However, there are numerous open issues facing the clinical success of adaptive therapy. Chief among these issues is the feasibility of real-time predictions of treatment response which represent a bedrock requirement of adaptive therapy. Generative artificial intelligence has the potential to learn prediction models of treatment response from clinical, molecular, and radiomics data about patients and their treatments. The article explores this potential through a proposed integration model of Generative Pre-Trained Transformers (GPTs) in a closed loop with adaptive treatments to predict the trajectories of disease progression. The conceptual model and the challenges facing its realization are discussed in the broader context of artificial intelligence integration in oncology.
Collapse
Affiliation(s)
- Youcef Derbal
- Ted Rogers School of Information Technology Management, Toronto Metropolitan University, Toronto, ON, Canada
| |
Collapse
|
9
|
Du M. Machine vs. human, who makes a better judgment on innovation? Take GPT-4 for example. Front Artif Intell 2023; 6:1206516. [PMID: 37680588 PMCID: PMC10482032 DOI: 10.3389/frai.2023.1206516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 08/02/2023] [Indexed: 09/09/2023] Open
Abstract
Introduction Human decision-making is a complex process that is often influenced by various external and internal factors. One such factor is noise, random, and irrelevant influences that can skew outcomes. Methods This essay uses the CAT test and computer simulations to measure creativity. Results Evidence indicates that humans are intrinsically prone to noise, leading to inconsistent and, at times, inaccurate decisions. In contrast, simple rules demonstrate a higher level of accuracy and consistency, while artificial intelligence demonstrates an even higher capability to process vast data and employ logical algorithms. Discussion The potential of AI, particularly its intuitive capabilities, might be surpassing human intuition in specific decision-making scenarios. This raises crucial questions about the future roles of humans and machines in decision-making spheres, especially in domains where precision is paramount.
Collapse
Affiliation(s)
- Mark Du
- Department of Computer Science, National Taiwan University, New Taipei, Taiwan
| |
Collapse
|