1
|
Yoon S, Jang J, Son G, Park S, Hwang J, Choeh JY, Choi KH. Predicting neuroticism with open-ended response using natural language processing. Front Psychiatry 2024; 15:1437569. [PMID: 39149156 PMCID: PMC11324482 DOI: 10.3389/fpsyt.2024.1437569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 07/17/2024] [Indexed: 08/17/2024] Open
Abstract
Introduction With rapid advancements in natural language processing (NLP), predicting personality using this technology has become a significant research interest. In personality prediction, exploring appropriate questions that elicit natural language is particularly important because questions determine the context of responses. This study aimed to predict levels of neuroticism-a core psychological trait known to predict various psychological outcomes-using responses to a series of open-ended questions developed based on the five-factor model of personality. This study examined the model's accuracy and explored the influence of item content in predicting neuroticism. Methods A total of 425 Korean adults were recruited and responded to 18 open-ended questions about their personalities, along with the measurement of the Five-Factor Model traits. In total, 30,576 Korean sentences were collected. To develop the prediction models, the pre-trained language model KoBERT was used. Accuracy, F1 Score, Precision, and Recall were calculated as evaluation metrics. Results The results showed that items inquiring about social comparison, unintended harm, and negative feelings performed better in predicting neuroticism than other items. For predicting depressivity, items related to negative feelings, social comparison, and emotions showed superior performance. For dependency, items related to unintended harm, social dominance, and negative feelings were the most predictive. Discussion We identified items that performed better at neuroticism prediction than others. Prediction models developed based on open-ended questions that theoretically aligned with neuroticism exhibited superior predictive performance.
Collapse
Affiliation(s)
- Seowon Yoon
- School of Psychology, Korea University, Seoul, Republic of Korea
- KU Mind Health Institute, Korea University, Seoul, Republic of Korea
| | - Jihee Jang
- School of Psychology, Korea University, Seoul, Republic of Korea
| | - Gaeun Son
- School of Psychology, Korea University, Seoul, Republic of Korea
| | - Soohyun Park
- School of Psychology, Korea University, Seoul, Republic of Korea
| | - Jueun Hwang
- School of Psychology, Korea University, Seoul, Republic of Korea
| | - Joon Yeon Choeh
- Department of Software, Sejong University, Seoul, Republic of Korea
| | - Kee-Hong Choi
- School of Psychology, Korea University, Seoul, Republic of Korea
- KU Mind Health Institute, Korea University, Seoul, Republic of Korea
| |
Collapse
|
2
|
Sikström S, Valavičiūtė I, Kuusela I, Evors N. Question-based computational language approach outperforms rating scales in quantifying emotional states. COMMUNICATIONS PSYCHOLOGY 2024; 2:45. [PMID: 39242812 PMCID: PMC11332055 DOI: 10.1038/s44271-024-00097-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 05/03/2024] [Indexed: 09/09/2024]
Abstract
Psychological constructs are commonly quantified with closed-ended rating scales. However, recent advancements in natural language processing (NLP) enable the quantification of open-ended language responses. Here we demonstrate that descriptive word responses analyzed using NLP show higher accuracy in categorizing emotional states compared to traditional rating scales. One group of participants (N = 297) generated narratives related to depression, anxiety, satisfaction, or harmony, summarized them with five descriptive words, and rated them using rating scales. Another group (N = 434) evaluated these narratives (with descriptive words and rating scales) from the author's perspective. The descriptive words were quantified using NLP, and machine learning was used to categorize the responses into the corresponding emotional states. The results showed a significantly higher number of accurate categorizations of the narratives based on descriptive words (64%) than on rating scales (44%), questioning the notion that rating scales are more precise in measuring emotional states than language-based measures.
Collapse
Affiliation(s)
- Sverker Sikström
- Department of Psychology, Lund University, Lund, SE-221 00, Sweden.
| | - Ieva Valavičiūtė
- Department of Psychology, Lund University, Lund, SE-221 00, Sweden
| | - Inari Kuusela
- Department of Psychology, Lund University, Lund, SE-221 00, Sweden
| | - Nicole Evors
- Department of Psychology, Lund University, Lund, SE-221 00, Sweden
| |
Collapse
|
3
|
Meier T, Mehl MR, Martin M, Horn AB. When I am sixty-four… evaluating language markers of well-being in healthy aging narratives. PLoS One 2024; 19:e0302103. [PMID: 38656961 PMCID: PMC11042717 DOI: 10.1371/journal.pone.0302103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 03/26/2024] [Indexed: 04/26/2024] Open
Abstract
Natural language use is a promising candidate for the development of innovative measures of well-being to complement self-report measures. The type of words individuals use can reveal important psychological processes that underlie well-being across the lifespan. In this preregistered, cross-sectional study, we propose a conceptual model of language markers of well-being and use written narratives about healthy aging (N = 701) and computerized text analysis (LIWC) to empirically validate the model. As hypothesized, we identified a model with three groups of language markers (reflecting affective, evaluative, and social processes). Initial validation with established self-report scales (N = 30 subscales) showed that these language markers reliably predict core components of well-being and underlying processes. Our results support the concurrent validity of the conceptual language model and allude to the added benefits of language-based measures, which are thought to reflect less conscious processes of well-being. Future research is needed to continue validating language markers of well-being across the lifespan in a theoretically informed and contextualized way, which will lay the foundation for inferring people's well-being from their natural language use.
Collapse
Affiliation(s)
- Tabea Meier
- Department of Psychology, University of Zurich, Zurich, Switzerland
- University Research Priority Program (URPP) “Dynamics of Healthy Aging”, University of Zurich, Zurich, Switzerland
- Healthy Longevity Center, University of Zurich, Zurich, Switzerland
- School of Education and Social Policy, Northwestern University, Evanston, Illinois, United States of America
| | - Matthias R. Mehl
- Department of Psychology, University of Arizona, Tucson, Arizona, United States of America
| | - Mike Martin
- Department of Psychology, University of Zurich, Zurich, Switzerland
- University Research Priority Program (URPP) “Dynamics of Healthy Aging”, University of Zurich, Zurich, Switzerland
- Healthy Longevity Center, University of Zurich, Zurich, Switzerland
- Center for Gerontology, University of Zurich, Zurich, Switzerland
- Faculty of Health and Behavioral Sciences, School of Psychology, The University of Queensland, Brisbane, Qld, Australia
| | - Andrea B. Horn
- Department of Psychology, University of Zurich, Zurich, Switzerland
- University Research Priority Program (URPP) “Dynamics of Healthy Aging”, University of Zurich, Zurich, Switzerland
- Healthy Longevity Center, University of Zurich, Zurich, Switzerland
- Center for Gerontology, University of Zurich, Zurich, Switzerland
| |
Collapse
|
4
|
Hitsuwari J, Okano H, Nomura M. Predicting attitudes toward ambiguity using natural language processing on free descriptions for open-ended question measurements. Sci Rep 2024; 14:8276. [PMID: 38594447 PMCID: PMC11004121 DOI: 10.1038/s41598-024-59118-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024] Open
Abstract
Individual traits and reactions to ambiguity differ and are conceptualized in terms of an individual's attitudes toward ambiguity or ambiguity tolerance. The development of natural language processing technology has made it possible to measure mental states and reactions through open-ended questions, rather than predefined numerical rating scales, which have traditionally been the dominant method in psychological research. This study presented three ambiguity-related situations and responses collected online from 591 participants in an open-ended format. After the analysis with bidirectional encoder representations from transformers, correlations were calculated using scores from the numerical evaluation by conventional questionnaire, and a significant moderate positive correlation was found. Therefore, this study found that attitudes toward ambiguity can be measured using an open-ended response method of reporting everyday life states. It is a novel methodology that can be expanded to other scales in psychology and can potentially be used in educational and clinical situations where participants can be asked to respond with minimal burden.
Collapse
Affiliation(s)
- Jimpei Hitsuwari
- Graduate School of Education, Kyoto University, Kyoto, Japan
- Japan Society for the Promotion of Science, Tokyo, Japan
| | - Hirohito Okano
- Graduate School of Education, Kyoto University, Kyoto, Japan
| | - Michio Nomura
- Graduate School of Education, Kyoto University, Kyoto, Japan.
| |
Collapse
|
5
|
Nilsson AH, Schwartz HA, Rosenthal RN, McKay JR, Vu H, Cho YM, Mahwish S, Ganesan AV, Ungar L. Language-based EMA assessments help understand problematic alcohol consumption. PLoS One 2024; 19:e0298300. [PMID: 38446796 PMCID: PMC10917301 DOI: 10.1371/journal.pone.0298300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 01/23/2024] [Indexed: 03/08/2024] Open
Abstract
BACKGROUND Unhealthy alcohol consumption is a severe public health problem. But low to moderate alcohol consumption is associated with high subjective well-being, possibly because alcohol is commonly consumed socially together with friends, who often are important for subjective well-being. Disentangling the health and social complexities of alcohol behavior has been difficult using traditional rating scales with cross-section designs. We aim to better understand these complexities by examining individuals' everyday affective subjective well-being language, in addition to rating scales, and via both between- and within-person designs across multiple weeks. METHOD We used daily language and ecological momentary assessment on 908 US restaurant workers (12692 days) over two-week intervals. Participants were asked up to three times a day to "describe your current feelings", rate their emotions, and report their alcohol behavior in the past 24 hours, including if they were drinking alone or with others. RESULTS Both between and within individuals, language-based subjective well-being predicted alcohol behavior more accurately than corresponding rating scales. Individuals self-reported being happier on days when drinking more, with language characteristic of these days predominantly describing socializing with friends. Between individuals (over several weeks), subjective well-being correlated much more negatively with drinking alone (r = -.29) than it did with total drinking (r = -.10). Aligned with this, people who drank more alone generally described their feelings as sad, stressed and anxious and drinking alone days related to nervous and annoyed language as well as a lower reported subjective well-being. CONCLUSIONS Individuals' daily subjective well-being, as measured via language, in part, explained the social aspects of alcohol drinking. Further, being alone explained this relationship, such that drinking alone was associated with lower subjective well-being.
Collapse
Affiliation(s)
- August Håkan Nilsson
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Oslo Business School, Oslo Metropolitan University, Oslo, Norway
| | - Hansen Andrew Schwartz
- Department of Computer Science, Stony Brook University, Stony Brook, New York, United States of America
| | - Richard N. Rosenthal
- Department of Psychiatry, Renaissance School of Medicine at Stony Brook University, Stony Brook, NY, United States of America
| | - James R. McKay
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Huy Vu
- Department of Computer Science, Stony Brook University, Stony Brook, New York, United States of America
| | - Young-Min Cho
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Syeda Mahwish
- Department of Computer Science, Stony Brook University, Stony Brook, New York, United States of America
| | - Adithya V. Ganesan
- Department of Computer Science, Stony Brook University, Stony Brook, New York, United States of America
| | - Lyle Ungar
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
6
|
Kjell ONE, Kjell K, Schwartz HA. Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Res 2024; 333:115667. [PMID: 38290286 DOI: 10.1016/j.psychres.2023.115667] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Revised: 12/03/2023] [Accepted: 12/05/2023] [Indexed: 02/01/2024]
Abstract
In this narrative review, we survey recent empirical evaluations of AI-based language assessments and present a case for the technology of large language models to be poised for changing standardized psychological assessment. Artificial intelligence has been undergoing a purported "paradigm shift" initiated by new machine learning models, large language models (e.g., BERT, LAMMA, and that behind ChatGPT). These models have led to unprecedented accuracy over most computerized language processing tasks, from web searches to automatic machine translation and question answering, while their dialogue-based forms, like ChatGPT have captured the interest of over a million users. The success of the large language model is mostly attributed to its capability to numerically represent words in their context, long a weakness of previous attempts to automate psychological assessment from language. While potential applications for automated therapy are beginning to be studied on the heels of chatGPT's success, here we present evidence that suggests, with thorough validation of targeted deployment scenarios, that AI's newest technology can move mental health assessment away from rating scales and to instead use how people naturally communicate, in language.
Collapse
Affiliation(s)
- Oscar N E Kjell
- Psychology Department, Lund University, Sweden; Computer Science Department, Stony Brook University, United States.
| | | | - H Andrew Schwartz
- Psychology Department, Lund University, Sweden; Computer Science Department, Stony Brook University, United States
| |
Collapse
|
7
|
Nilsson AH, Eichstaedt JC, Lomas T, Schwartz A, Kjell O. The Cantril Ladder elicits thoughts about power and wealth. Sci Rep 2024; 14:2642. [PMID: 38302578 PMCID: PMC10834405 DOI: 10.1038/s41598-024-52939-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 01/25/2024] [Indexed: 02/03/2024] Open
Abstract
The Cantril Ladder is among the most widely administered subjective well-being measures; every year, it is collected in 140+ countries in the Gallup World Poll and reported in the World Happiness Report. The measure asks respondents to evaluate their lives on a ladder from worst (bottom) to best (top). Prior work found Cantril Ladder scores sensitive to social comparison and to reflect one's relative position in the income distribution. To understand this, we explored how respondents interpret the Cantril Ladder. We analyzed word responses from 1581 UK adults and tested the impact of the (a) ladder imagery, (b) scale anchors of worst to best possible life, and c) bottom to top. Using three language analysis techniques (dictionary, topic, and word embeddings), we found that the Cantril Ladder framing emphasizes power and wealth over broader well-being and relationship concepts in comparison to the other study conditions. Further, altering the framings increased preferred scale levels from 8.4 to 8.9 (Cohen's d = 0.36). Introducing harmony as an anchor yielded the strongest divergence from the Cantril Ladder, reducing mentions of power and wealth topics the most (Cohen's d = -0.76). Our findings refine the understanding of historical Cantril Ladder data and may help guide the future evolution of well-being metrics and guidelines.
Collapse
Affiliation(s)
- August Håkan Nilsson
- Department of Psychology, Lund University, Lund, Sweden.
- Oslo Business School, Oslo Metropolitan University, Oslo, Norway.
| | - Johannes C Eichstaedt
- Department of Psychology, Institute for Human-Centered A.I., Stanford University, Stanford, CA, USA
| | - Tim Lomas
- Department of Epidemiology, Harvard University, Cambridge, USA
| | - Andrew Schwartz
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Oscar Kjell
- Department of Psychology, Lund University, Lund, Sweden
| |
Collapse
|
8
|
Aroyehun ST, Malik L, Metzler H, Haimerl N, Di Natale A, Garcia D. LEIA: Linguistic Embeddings for the Identification of Affect. EPJ DATA SCIENCE 2023; 12:52. [PMID: 38020476 PMCID: PMC10654159 DOI: 10.1140/epjds/s13688-023-00427-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Accepted: 10/30/2023] [Indexed: 12/01/2023]
Abstract
The wealth of text data generated by social media has enabled new kinds of analysis of emotions with language models. These models are often trained on small and costly datasets of text annotations produced by readers who guess the emotions expressed by others in social media posts. This affects the quality of emotion identification methods due to training data size limitations and noise in the production of labels used in model development. We present LEIA, a model for emotion identification in text that has been trained on a dataset of more than 6 million posts with self-annotated emotion labels for happiness, affection, sadness, anger, and fear. LEIA is based on a word masking method that enhances the learning of emotion words during model pre-training. LEIA achieves macro-F1 values of approximately 73 on three in-domain test datasets, outperforming other supervised and unsupervised methods in a strong benchmark that shows that LEIA generalizes across posts, users, and time periods. We further perform an out-of-domain evaluation on five different datasets of social media and other sources, showing LEIA's robust performance across media, data collection methods, and annotation schemes. Our results show that LEIA generalizes its classification of anger, happiness, and sadness beyond the domain it was trained on. LEIA can be applied in future research to provide better identification of emotions in text from the perspective of the writer.
Collapse
Affiliation(s)
- Segun Taofeek Aroyehun
- Department of Politics and Public Administration, University of Konstanz, Konstanz, Germany
- Graz University of Technology, Graz, Austria
| | - Lukas Malik
- Complexity Science Hub, Vienna, Austria
- Université Paris Saclay, Paris, France
| | - Hannah Metzler
- Graz University of Technology, Graz, Austria
- Complexity Science Hub, Vienna, Austria
- Medical University of Vienna, Vienna, Austria
| | | | - Anna Di Natale
- Graz University of Technology, Graz, Austria
- Complexity Science Hub, Vienna, Austria
- Medical University of Vienna, Vienna, Austria
| | - David Garcia
- Department of Politics and Public Administration, University of Konstanz, Konstanz, Germany
- Graz University of Technology, Graz, Austria
- Complexity Science Hub, Vienna, Austria
- Medical University of Vienna, Vienna, Austria
| |
Collapse
|
9
|
Simchon A, Sutton A, Edwards M, Lewandowsky S. Online reading habits can reveal personality traits: towards detecting psychological microtargeting. PNAS NEXUS 2023; 2:pgad191. [PMID: 37333766 PMCID: PMC10276193 DOI: 10.1093/pnasnexus/pgad191] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 05/21/2023] [Accepted: 05/30/2023] [Indexed: 06/20/2023]
Abstract
Building on big data from Reddit, we generated two computational text models: (i) Predicting the personality of users from the text they have written and (ii) predicting the personality of users based on the text they have consumed. The second model is novel and without precedent in the literature. We recruited active Reddit users (N = 1 , 105 ) of fiction-writing communities. The participants completed a Big Five personality questionnaire and consented for their Reddit activity to be scraped and used to create a machine learning model. We trained an natural language processing model [Bidirectional Encoder Representations from Transformers (BERT)], predicting personality from produced text (average performance: r = 0.33 ). We then applied this model to a new set of Reddit users (N = 10 , 050 ), predicted their personality based on their produced text, and trained a second BERT model to predict their predicted-personality scores based on consumed text (average performance: r = 0.13 ). By doing so, we provide the first glimpse into the linguistic markers of personality-congruent consumed content.
Collapse
Affiliation(s)
- Almog Simchon
- School of Psychological Science, University of Bristol, Bristol BS8 1QU, UK
| | - Adam Sutton
- Department of Computer Science, University of Bristol, Bristol BS8 1QU, UK
| | - Matthew Edwards
- Department of Computer Science, University of Bristol, Bristol BS8 1QU, UK
| | - Stephan Lewandowsky
- School of Psychological Science, University of Bristol, Bristol BS8 1QU, UK
- School of Psychological Science, The University of Western Australia, Perth 6009, Australia
- Department of Psychology, The University of Potsdam, Potsdam, Germany
| |
Collapse
|
10
|
Chen ZS, Kulkarni P(P, Galatzer-Levy IR, Bigio B, Nasca C, Zhang Y. Modern views of machine learning for precision psychiatry. PATTERNS (NEW YORK, N.Y.) 2022; 3:100602. [PMID: 36419447 PMCID: PMC9676543 DOI: 10.1016/j.patter.2022.100602] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
In light of the National Institute of Mental Health (NIMH)'s Research Domain Criteria (RDoC), the advent of functional neuroimaging, novel technologies and methods provide new opportunities to develop precise and personalized prognosis and diagnosis of mental disorders. Machine learning (ML) and artificial intelligence (AI) technologies are playing an increasingly critical role in the new era of precision psychiatry. Combining ML/AI with neuromodulation technologies can potentially provide explainable solutions in clinical practice and effective therapeutic treatment. Advanced wearable and mobile technologies also call for the new role of ML/AI for digital phenotyping in mobile mental health. In this review, we provide a comprehensive review of ML methodologies and applications by combining neuroimaging, neuromodulation, and advanced mobile technologies in psychiatry practice. We further review the role of ML in molecular phenotyping and cross-species biomarker identification in precision psychiatry. We also discuss explainable AI (XAI) and neuromodulation in a closed human-in-the-loop manner and highlight the ML potential in multi-media information extraction and multi-modal data fusion. Finally, we discuss conceptual and practical challenges in precision psychiatry and highlight ML opportunities in future research.
Collapse
Affiliation(s)
- Zhe Sage Chen
- Department of Psychiatry, New York University Grossman School of Medicine, New York, NY 10016, USA
- Department of Neuroscience and Physiology, New York University Grossman School of Medicine, New York, NY 10016, USA
- The Neuroscience Institute, New York University Grossman School of Medicine, New York, NY 10016, USA
- Department of Biomedical Engineering, New York University Tandon School of Engineering, Brooklyn, NY 11201, USA
| | | | - Isaac R. Galatzer-Levy
- Department of Psychiatry, New York University Grossman School of Medicine, New York, NY 10016, USA
- Meta Reality Lab, New York, NY, USA
| | - Benedetta Bigio
- Department of Psychiatry, New York University Grossman School of Medicine, New York, NY 10016, USA
| | - Carla Nasca
- Department of Psychiatry, New York University Grossman School of Medicine, New York, NY 10016, USA
- The Neuroscience Institute, New York University Grossman School of Medicine, New York, NY 10016, USA
| | - Yu Zhang
- Department of Bioengineering, Lehigh University, Bethlehem, PA 18015, USA
- Department of Electrical and Computer Engineering, Lehigh University, Bethlehem, PA 18015, USA
| |
Collapse
|
11
|
van Loon A. Three families of automated text analysis. SOCIAL SCIENCE RESEARCH 2022; 108:102798. [PMID: 36334926 DOI: 10.1016/j.ssresearch.2022.102798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 09/14/2022] [Accepted: 09/18/2022] [Indexed: 06/16/2023]
Abstract
Since the beginning of this millennium, data in the form of human-generated text in a machine-readable format has become increasingly available to social scientists, presenting a unique window into social life. However, harnessing vast quantities of this highly unstructured data in a systematic way presents a unique combination of analytical and methodological challenges. Luckily, our understanding of how to overcome these challenges has also developed greatly over this same period. In this article, I present a novel typology of the methods social scientists have used to analyze text data at scale in the interest of testing and developing social theory. I describe three "families" of methods: analyses of (1) term frequency, (2) document structure, and (3) semantic similarity. For each family of methods, I discuss their logical and statistical foundations, analytical strengths and weaknesses, as well as prominent variants and applications.
Collapse
|