1
|
Nussbaum C, Frühholz S, Schweinberger SR. Understanding voice naturalness. Trends Cogn Sci 2025; 29:467-480. [PMID: 40011186 DOI: 10.1016/j.tics.2025.01.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 01/20/2025] [Accepted: 01/23/2025] [Indexed: 02/28/2025]
Abstract
The perceived naturalness of a voice is a prominent property emerging from vocal sounds, which affects our interaction with both human and artificial agents. Despite its importance, a systematic understanding of voice naturalness is elusive. This is due to (i) conceptual underspecification, (ii) heterogeneous operationalization, (iii) lack of exchange between research on human and synthetic voices, and (iv) insufficient anchoring in voice perception theory. This review reflects on current insights into voice naturalness by pooling evidence from a wider interdisciplinary literature. Against that backdrop, it offers a concise definition of naturalness and proposes a conceptual framework rooted in both empirical findings and theoretical models. Finally, it identifies gaps in current understanding of voice naturalness and sketches perspectives for empirical progress.
Collapse
Affiliation(s)
- Christine Nussbaum
- Department for General Psychology and Cognitive Neuroscience, Friedrich Schiller University Jena, 07743 Jena, Germany; Voice Research Unit, Friedrich Schiller University, 07743 Jena, Germany; The Voice Communication Sciences (VoCS) MSCA Doctoral Network.
| | - Sascha Frühholz
- The Voice Communication Sciences (VoCS) MSCA Doctoral Network; Department of Psychology, University of Oslo, 0371 Oslo, Norway; Cognitive and Affective Neuroscience Unit, University of Zurich, 8050 Zurich, Switzerland
| | - Stefan R Schweinberger
- Department for General Psychology and Cognitive Neuroscience, Friedrich Schiller University Jena, 07743 Jena, Germany; Voice Research Unit, Friedrich Schiller University, 07743 Jena, Germany; The Voice Communication Sciences (VoCS) MSCA Doctoral Network; Swiss Center for Affective Sciences, University of Geneva, 1222 Geneva, Switzerland; German Center for Mental Health (DZPG), Site Jena-Halle-, Magdeburg, Germany
| |
Collapse
|
2
|
Lavan N, Ahmed A, Tyrene Oteng C, Aden M, Nasciemento-Krüger L, Raffiq Z, Mareschal I. Similarities in emotion perception from faces and voices: evidence from emotion sorting tasks. Cogn Emot 2025:1-17. [PMID: 40088052 DOI: 10.1080/02699931.2025.2478478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Revised: 12/03/2024] [Accepted: 01/14/2025] [Indexed: 03/17/2025]
Abstract
Emotions are expressed via many features including facial displays, vocal intonation, and touch, and perceivers can often interpret emotional displays across the different modalities with high accuracy. Here, we examine how emotion perception from faces and voices relates to one another, probing individual differences in emotion recognition abilities across visual and auditory modalities. We developed a novel emotion sorting task, in which participants were tasked with freely grouping different stimuli into perceived emotional categories, without requiring pre-defined emotion labels. Participants completed two emotion sorting tasks, one using silent videos of facial expressions, the other with audio recordings of vocal expressions. We furthermore manipulated the emotional intensity, contrasting more subtle, lower intensity vs higher intensity emotion portrayals. We find that participants' performance on the emotion sorting task was similar for face and voice stimuli. As expected, performance was lower when stimuli were of low emotional intensity. Consistent with previous reports, we find that task performance was positively correlated across the two modalities. Our findings show that emotion perception in the visual and auditory modalities may be underpinned by similar and/or shared processes, highlighting that emotion sorting tasks are powerful paradigms to investigate emotion recognition from voices, cross-modal and multimodal emotion recognition.
Collapse
Affiliation(s)
- Nadine Lavan
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Centre for Brain and Behaviour, Queen Mary University of London, London, UK
| | - Aleena Ahmed
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Centre for Brain and Behaviour, Queen Mary University of London, London, UK
| | - Chantelle Tyrene Oteng
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Centre for Brain and Behaviour, Queen Mary University of London, London, UK
| | - Munira Aden
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Centre for Brain and Behaviour, Queen Mary University of London, London, UK
| | - Luisa Nasciemento-Krüger
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Centre for Brain and Behaviour, Queen Mary University of London, London, UK
| | - Zahra Raffiq
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Centre for Brain and Behaviour, Queen Mary University of London, London, UK
| | - Isabelle Mareschal
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Centre for Brain and Behaviour, Queen Mary University of London, London, UK
| |
Collapse
|
3
|
Kirk NW, Cunningham SJ. Listen to yourself! Prioritization of self-associated and own voice cues. Br J Psychol 2025; 116:131-148. [PMID: 39361444 PMCID: PMC11724686 DOI: 10.1111/bjop.12741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 09/16/2024] [Indexed: 10/05/2024]
Abstract
Self-cues such as one's own name or face attract attention, reflecting a bias for stimuli connected to self to be prioritized in cognition. Recent evidence suggests that even external voices can elicit this self-prioritization effect; in a voice-label matching task, external voices assigned to the Self-identity label 'you' elicited faster responses than those assigned to 'friend' or 'stranger' (Payne et al., Br. J. Psychology, 112, 585-610). However, it is not clear whether external voices assigned to Self are prioritized over participants' own voices. We explore this issue in two experiments. In Exp 1 (N = 35), a voice-label matching task comprising three external voices confirmed that reaction time and accuracy are improved when an external voice cue is assigned to Self rather than Friend or Stranger. In Exp 2 (N = 90), one of the voice cues was replaced with a recording of the participant's own voice. Reaction time and accuracy showed a consistent advantage for the participant's own-voice, even when it was assigned to the 'friend' or 'stranger' identity. These findings show that external voices can elicit self-prioritization effects if associated with Self, but they are not prioritized above individuals' own voices. This has implications for external voice production technology, suggesting own-voice imitation may be beneficial.
Collapse
Affiliation(s)
- Neil W. Kirk
- Division of Sociological and Psychological SciencesAbertay UniversityDundeeUK
| | | |
Collapse
|
4
|
Jiang T, Zhou G. Semantic Content in Face Representation: Essential for Proficient Recognition of Unfamiliar Faces by Good Recognizers. Cogn Sci 2024; 48:e70020. [PMID: 39587972 DOI: 10.1111/cogs.70020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 10/28/2024] [Accepted: 10/30/2024] [Indexed: 11/27/2024]
Abstract
Face recognition is adapted to achieve goals of social interactions, which rely on further processing of the semantic information of faces, beyond visual computations. Here, we explored the semantic content of face representation apart from visual component, and tested their relations to face recognition performance. Specifically, we propose that enhanced visual or semantic coding could underlie the advantage of familiar over unfamiliar faces recognition, as well as the superior recognition of skilled face recognizers. We asked participants to freely describe familiar/unfamiliar faces using words or phrases, and converted these descriptions into semantic vectors. Face semantics were transformed into quantifiable face vectors by aggregating these word/phrase vectors. We also extracted visual features from a deep convolutional neural network and obtained the visual representation of familiar/unfamiliar faces. Semantic and visual representations were used to predict perceptual representation generated from a behavior rating task separately in different groups (bad/good face recognizers in familiar-face/unfamiliar-face conditions). Comparisons revealed that although long-term memory facilitated visual feature extraction for familiar faces compared to unfamiliar faces, good recognizers compensated for this disparity by incorporating more semantic information for unfamiliar faces, a strategy not observed in bad recognizers. This study highlights the significance of semantics in recognizing unfamiliar faces.
Collapse
Affiliation(s)
- Tong Jiang
- Department of Psychology, Sun Yat-sen University
| | - Guomei Zhou
- Department of Psychology, Sun Yat-sen University
| |
Collapse
|
5
|
Liu Y, Li D, Wang W, Jiang Z. What will we judge a book by its cover?-Content analysis of face perception in a Chinese sample. Acta Psychol (Amst) 2024; 251:104631. [PMID: 39622149 DOI: 10.1016/j.actpsy.2024.104631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 11/28/2024] [Accepted: 11/28/2024] [Indexed: 12/16/2024] Open
Abstract
People can perceive various information from faces. Most of previous studies of face perception only focused on one of attributes, such as gender, expression, personality etc., the whole picture of face perception is far from clear. Therefore, the present study recruited Chinese participants to provide spontaneous descriptions of unfamiliar Chinese faces without any constraints of content. It turned out that descriptions employed a broad spectrum of descriptors, as well as a consistent pattern across different identities: descriptions that incorporated psychological characteristics were most prevalent, whereas mentions of physiological attributes generally occurred earlier in the description than other types of descriptive vocabulary. These results underscore the special role of free description analysis in revealing the panorama of face perception, where perceivers swiftly infer a plenty of character traits in an organized way, ultimately forming a comprehensive impression of others.
Collapse
Affiliation(s)
- Yangtao Liu
- School of Psychology, Liaoning Normal University, Dalian, China
| | - Dong Li
- School of Psychology, Liaoning Normal University, Dalian, China
| | - Wenbo Wang
- School of Psychology, Liaoning Normal University, Dalian, China
| | - Zhongqing Jiang
- School of Psychology, Liaoning Normal University, Dalian, China.
| |
Collapse
|
6
|
Lavan N, Rinke P, Scharinger M. The time course of person perception from voices in the brain. Proc Natl Acad Sci U S A 2024; 121:e2318361121. [PMID: 38889147 PMCID: PMC11214051 DOI: 10.1073/pnas.2318361121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 04/26/2024] [Indexed: 06/20/2024] Open
Abstract
When listeners hear a voice, they rapidly form a complex first impression of who the person behind that voice might be. We characterize how these multivariate first impressions from voices emerge over time across different levels of abstraction using electroencephalography and representational similarity analysis. We find that for eight perceived physical (gender, age, and health), trait (attractiveness, dominance, and trustworthiness), and social characteristics (educatedness and professionalism), representations emerge early (~80 ms after stimulus onset), with voice acoustics contributing to those representations between ~100 ms and 400 ms. While impressions of person characteristics are highly correlated, we can find evidence for highly abstracted, independent representations of individual person characteristics. These abstracted representationse merge gradually over time. That is, representations of physical characteristics (age, gender) arise early (from ~120 ms), while representations of some trait and social characteristics emerge later (~360 ms onward). The findings align with recent theoretical models and shed light on the computations underpinning person perception from voices.
Collapse
Affiliation(s)
- Nadine Lavan
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Queen Mary University of London, LondonE1 4NS, United Kingdom
| | - Paula Rinke
- Research Group Phonetics, Institute of German Linguistics, Philipps-University Marburg, Marburg35037, Germany
| | - Mathias Scharinger
- Research Group Phonetics, Institute of German Linguistics, Philipps-University Marburg, Marburg35037, Germany
- Research Center “Deutscher Sprachatlas”, Philipps-University Marburg, Marburg35037, Germany
- Center for Mind, Brain & Behavior, Universities of Marburg & Gießen, Marburg35032, Germany
| |
Collapse
|
7
|
Henry M, Bent T, Holt RF. "They sure aren't from around here": Children's perception of accent distance in L1 and L2 varieties of English. JOURNAL OF CHILD LANGUAGE 2024:1-24. [PMID: 38646726 DOI: 10.1017/s0305000924000138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Children exhibit preferences for familiar accents early in life. However, they frequently have more difficulty distinguishing between first language (L1) accents than second language (L2) accents in categorization tasks. Few studies have addressed children's perception of accent strength, or the relation between accent strength and objective measures of pronunciation distance. To address these gaps, 6- and 12-year-olds and adults ranked talkers' perceived distance from the local accent (i.e., Midland American English). Rankings were compared with objective distance measures. Acoustic and phonetic distance measures were significant predictors of ladder rankings, but there was no evidence that children and adults significantly differed in their sensitivity to accent strength. Levenshtein Distance, a phonetic distance metric, was the strongest predictor of perceptual rankings for both children and adults. As a percept, accent strength has critical implications for social judgments, which determine real world social outcomes for talkers with non-local accents.
Collapse
Affiliation(s)
- Malachi Henry
- Indiana University, Department of Speech, Language and Hearing Sciences, USA
| | - Tessa Bent
- Indiana University, Department of Speech, Language and Hearing Sciences, USA
| | - Rachael F Holt
- Ohio State University, Department of Speech and Hearing Science, USA
| |
Collapse
|
8
|
Neuenswander KL, Gillespie GSR, Lick DJ, Bryant GA, Johnson KL. Social evaluative implications of sensory adaptation to human voices. ROYAL SOCIETY OPEN SCIENCE 2024; 11:231348. [PMID: 38544561 PMCID: PMC10966390 DOI: 10.1098/rsos.231348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Revised: 11/16/2023] [Accepted: 02/09/2024] [Indexed: 04/26/2024]
Abstract
People form social evaluations of others following brief exposure to their voices, and these impressions are calibrated based on recent perceptual experience. Participants adapted to voices with fundamental frequency (f o; the acoustic correlate of perceptual pitch) manipulated to be gender-typical (i.e. masculine men and feminine women) or gender-atypical (i.e. feminine men and masculine women) before evaluating unaltered test voices within the same sex. Adaptation resulted in contrastive aftereffects. Listening to gender-atypical voices caused female voices to sound more feminine and attractive (Study 1) and male voices to sound more masculine and attractive (Study 2). Studies 3a and 3b tested whether adaptation occurred on a conceptual or perceptual level, respectively. In Study 3a, perceivers adapted to gender-typical or gender-atypical voices for both men and women (i.e. adaptors pitch manipulated in opposite directions for men and women) before evaluating unaltered test voices. Findings showed weak evidence that evaluations differed between conditions. In Study 3b, perceivers adapted to masculinized or feminized voices for both men and women (i.e. adaptors pitch manipulated in the same direction for men and women) before evaluating unaltered test voices. In the feminized condition, participants rated male targets as more masculine and attractive. Conversely, in the masculinized condition, participants rated female targets as more feminine and attractive. Voices appear to be evaluated according to gender norms that are updated based on perceptual experience as well as conceptual knowledge.
Collapse
Affiliation(s)
| | | | | | - Gregory A. Bryant
- Department of Communication, University of California, Los Angeles, CA90095, USA
| | - Kerri L. Johnson
- Department of Communication, University of California, Los Angeles, CA90095, USA
- Department of Psychology, University of California, Los Angeles, CA, USA
| |
Collapse
|
9
|
Lavan N, McGettigan C. A model for person perception from familiar and unfamiliar voices. COMMUNICATIONS PSYCHOLOGY 2023; 1:1. [PMID: 38665246 PMCID: PMC11041786 DOI: 10.1038/s44271-023-00001-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Accepted: 04/28/2023] [Indexed: 04/28/2024]
Abstract
When hearing a voice, listeners can form a detailed impression of the person behind the voice. Existing models of voice processing focus primarily on one aspect of person perception - identity recognition from familiar voices - but do not account for the perception of other person characteristics (e.g., sex, age, personality traits). Here, we present a broader perspective, proposing that listeners have a common perceptual goal of perceiving who they are hearing, whether the voice is familiar or unfamiliar. We outline and discuss a model - the Person Perception from Voices (PPV) model - that achieves this goal via a common mechanism of recognising a familiar person, persona, or set of speaker characteristics. Our PPV model aims to provide a more comprehensive account of how listeners perceive the person they are listening to, using an approach that incorporates and builds on aspects of the hierarchical frameworks and prototype-based mechanisms proposed within existing models of voice identity recognition.
Collapse
Affiliation(s)
- Nadine Lavan
- Department of Experimental and Biological Psychology, Queen Mary University of London, London, UK
| | - Carolyn McGettigan
- Department of Speech, Hearing, and Phonetic Sciences, University College London, London, UK
| |
Collapse
|
10
|
Abstract
Listeners spontaneously form impressions of a person from their voice: Is someone old or young? Trustworthy or untrustworthy? Some studies suggest that these impressions emerge rapidly (e.g., < 400 ms for traits), but it is unclear just how rapidly different impressions can emerge and whether the time courses differ across characteristics. I presented 618 adult listeners with voice recordings ranging from 25 ms to 800 ms in duration and asked them to rate physical (age, sex, health), trait (trustworthiness, dominance, attractiveness), and social (educatedness, poshness, professionalism) characteristics. I then used interrater agreement as an index for impression formation. Impressions of physical characteristics and dominance emerged fastest, showing high agreement after only 25 ms of exposure. In contrast, agreement for trait and social characteristics was initially low to moderate and gradually increased. Such a staggered time course suggests that there could be a temporo-perceptual hierarchy for person perception in which faster impressions could influence later ones.
Collapse
Affiliation(s)
- Nadine Lavan
- Department of Biological and Experimental Psychology, School of Biological and Behavioural Sciences, Queen Mary University of London
| |
Collapse
|