1
|
Xie X, Jaeger TF, Kurumada C. What we do (not) know about the mechanisms underlying adaptive speech perception: A computational framework and review. Cortex 2023; 166:377-424. [PMID: 37506665 DOI: 10.1016/j.cortex.2023.05.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 12/23/2022] [Accepted: 05/05/2023] [Indexed: 07/30/2023]
Abstract
Speech from unfamiliar talkers can be difficult to comprehend initially. These difficulties tend to dissipate with exposure, sometimes within minutes or less. Adaptivity in response to unfamiliar input is now considered a fundamental property of speech perception, and research over the past two decades has made substantial progress in identifying its characteristics. The mechanisms underlying adaptive speech perception, however, remain unknown. Past work has attributed facilitatory effects of exposure to any one of three qualitatively different hypothesized mechanisms: (1) low-level, pre-linguistic, signal normalization, (2) changes in/selection of linguistic representations, or (3) changes in post-perceptual decision-making. Direct comparisons of these hypotheses, or combinations thereof, have been lacking. We describe a general computational framework for adaptive speech perception (ASP) that-for the first time-implements all three mechanisms. We demonstrate how the framework can be used to derive predictions for experiments on perception from the acoustic properties of the stimuli. Using this approach, we find that-at the level of data analysis presently employed by most studies in the field-the signature results of influential experimental paradigms do not distinguish between the three mechanisms. This highlights the need for a change in research practices, so that future experiments provide more informative results. We recommend specific changes to experimental paradigms and data analysis. All data and code for this study are shared via OSF, including the R markdown document that this article is generated from, and an R library that implements the models we present.
Collapse
Affiliation(s)
- Xin Xie
- Language Science, University of California, Irvine, USA.
| | - T Florian Jaeger
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA; Computer Science, University of Rochester, Rochester, NY, USA
| | - Chigusa Kurumada
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA
| |
Collapse
|
2
|
Lehet M, Holt LL. Nevertheless, it persists: Dimension-based statistical learning and normalization of speech impact different levels of perceptual processing. Cognition 2020; 202:104328. [PMID: 32502867 DOI: 10.1016/j.cognition.2020.104328] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Revised: 05/04/2020] [Accepted: 05/13/2020] [Indexed: 11/25/2022]
Abstract
Speech is notoriously variable, with no simple mapping from acoustics to linguistically-meaningful units like words and phonemes. Empirical research on this theoretically central issue establishes at least two classes of perceptual phenomena that accommodate acoustic variability: normalization and perceptual learning. Intriguingly, perceptual learning is supported by learning across acoustic variability, but normalization is thought to counteract acoustic variability leaving open questions about how these two phenomena might interact. Here, we examine the joint impact of normalization and perceptual learning on how acoustic dimensions map to vowel categories. As listeners categorized nonwords as setch or satch, they experienced a shift in short-term distributional regularities across the vowels' acoustic dimensions. Introduction of this 'artificial accent' resulted in a shift in the contribution of vowel duration in categorization. Although this dimension-based statistical learning impacted the influence of vowel duration on vowel categorization, the duration of these very same vowels nonetheless maintained a consistent influence on categorization of a subsequent consonant via duration contrast, a form of normalization. Thus, vowel duration had a duplex role consistent with normalization and perceptual learning operating on distinct levels in the processing hierarchy. We posit that whereas normalization operates across auditory dimensions, dimension-based statistical learning impacts the connection weights among auditory dimensions and phonetic categories.
Collapse
Affiliation(s)
- Matthew Lehet
- Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15232, USA; Center for the Neural Basis of Cognition, Pittsburgh, PA 15232, USA
| | - Lori L Holt
- Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15232, USA; Center for the Neural Basis of Cognition, Pittsburgh, PA 15232, USA; Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15232, USA.
| |
Collapse
|
3
|
Interleaved lexical and audiovisual information can retune phoneme boundaries. Atten Percept Psychophys 2020; 82:2018-2026. [DOI: 10.3758/s13414-019-01961-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
4
|
Liu L, Jaeger TF. Talker-specific pronunciation or speech error? Discounting (or not) atypical pronunciations during speech perception. J Exp Psychol Hum Percept Perform 2019; 45:1562-1588. [PMID: 31750716 DOI: 10.1037/xhp0000693] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Perceptual recalibration allows listeners to adapt to talker-specific pronunciations, such as atypical realizations of specific sounds. Such recalibration can facilitate robust speech recognition. However, indiscriminate recalibration following any atypically pronounced words also risks interpreting pronunciations as characteristic of a talker that are in reality because of incidental, short-lived factors (such as a speech error). We investigate whether the mechanisms underlying perceptual recalibration involve inferences about the causes for unexpected pronunciations. In 5 experiments, we ask whether perceptual recalibration is blocked if the atypical pronunciations of an unfamiliar talker can also be attributed to other incidental causes. We investigated 3 type of incidental causes for atypical pronunciations: the talker is intoxicated, the talker speaks unusually fast, or the atypical pronunciations occur only in the context of tongue twisters. In all 5 experiments, we find robust evidence for perceptual recalibration, but little evidence that the presence of incidental causes block perceptual recalibration. We discuss these results in light of other recent findings that incidental causes can block perceptual recalibration. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Collapse
Affiliation(s)
- Linda Liu
- Department of Brain and Cognitive Sciences, University of Rochester
| | - T Florian Jaeger
- Department of Brain and Cognitive Sciences, University of Rochester
| |
Collapse
|
5
|
Attentional resources contribute to the perceptual learning of talker idiosyncrasies in audiovisual speech. Atten Percept Psychophys 2019; 81:1006-1019. [PMID: 30684204 DOI: 10.3758/s13414-018-01651-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
To recognize audiovisual speech, listeners evaluate and combine information obtained from the auditory and visual modalities. Listeners also use information from one modality to adjust their phonetic categories to a talker's idiosyncrasy encountered in the other modality. In this study, we examined whether the outcome of this cross-modal recalibration relies on attentional resources. In a standard recalibration experiment in Experiment 1, participants heard an ambiguous sound, disambiguated by the accompanying visual speech as either /p/ or /t/. Participants' primary task was to attend to the audiovisual speech while either monitoring a tone sequence for a target tone or ignoring the tones. Listeners subsequently categorized the steps of an auditory /p/-/t/ continuum more often in line with their exposure. The aftereffect of phonetic recalibration was reduced, but not eliminated, by attentional load during exposure. In Experiment 2, participants saw an ambiguous visual speech gesture that was disambiguated auditorily as either /p/ or /t/. At test, listeners categorized the steps of a visual /p/-/t/ continuum more often in line with the prior exposure. Imposing load in the auditory modality during exposure did not reduce the aftereffect of this type of cross-modal phonetic recalibration. Together, these results suggest that auditory attentional resources are needed for the processing of auditory speech and/or for the shifting of auditory phonetic category boundaries. Listeners thus need to dedicate attentional resources in order to accommodate talker idiosyncrasies in audiovisual speech.
Collapse
|
6
|
Liu L, Jaeger TF. Inferring causes during speech perception. Cognition 2018; 174:55-70. [PMID: 29425987 PMCID: PMC6553948 DOI: 10.1016/j.cognition.2018.01.003] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2017] [Revised: 01/05/2018] [Accepted: 01/08/2018] [Indexed: 11/16/2022]
Abstract
One of the central challenges in speech perception is the lack of invariance: talkers differ in how they map words onto the speech signal. Previous work has shown that one mechanism by which listeners overcome this variability is adaptation. However, talkers differ in how they pronounce words for a number of reasons, ranging from more permanent, characteristic factors such as having a foreign accent, to more temporary, incidental factors, such as speaking with a pen in the mouth. One challenge for listeners is that the true cause underlying atypical pronunciations is never directly known, and instead must be inferred from (often causally ambiguous) evidence. In three experiments, we investigate whether these inferences underlie speech perception, and how the speech perception system deals with uncertainty about competing causes for atypical pronunciations. We find that adaptation to atypical pronunciations is affected by whether the atypical pronunciations are seen as characteristic or incidental. Furthermore, we find that listeners are able to maintain information about previous causally ambiguous pronunciations that they experience, and use this previously experienced evidence to drive their adaptation after additional evidence has disambiguated the cause. Our findings revise previous proposals that causally ambiguous evidence is ignored during speech adaptation.
Collapse
Affiliation(s)
- Linda Liu
- Department of Brain and Cognitive Sciences, University of Rochester, USA.
| | - T Florian Jaeger
- Department of Brain and Cognitive Sciences, University of Rochester, USA; Department of Computer Science, University of Rochester, USA.
| |
Collapse
|
7
|
Noppeney U, Lee HL. Causal inference and temporal predictions in audiovisual perception of speech and music. Ann N Y Acad Sci 2018; 1423:102-116. [PMID: 29604082 DOI: 10.1111/nyas.13615] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Revised: 12/13/2017] [Accepted: 12/22/2017] [Indexed: 11/28/2022]
Abstract
To form a coherent percept of the environment, the brain must integrate sensory signals emanating from a common source but segregate those from different sources. Temporal regularities are prominent cues for multisensory integration, particularly for speech and music perception. In line with models of predictive coding, we suggest that the brain adapts an internal model to the statistical regularities in its environment. This internal model enables cross-sensory and sensorimotor temporal predictions as a mechanism to arbitrate between integration and segregation of signals from different senses.
Collapse
Affiliation(s)
- Uta Noppeney
- Computational Neuroscience and Cognitive Robotics Centre, University of Birmingham, Birmingham, UK
| | - Hwee Ling Lee
- German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany
| |
Collapse
|
8
|
Abstract
Listeners adjust their phonetic categories to cope with variations in the speech signal (phonetic recalibration). Previous studies have shown that lipread speech (and word knowledge) can adjust the perception of ambiguous speech and can induce phonetic adjustments (Bertelson, Vroomen, & de Gelder in Psychological Science, 14(6), 592–597, 2003; Norris, McQueen, & Cutler in Cognitive Psychology, 47(2), 204–238, 2003). We examined whether orthographic information (text) also can induce phonetic recalibration. Experiment 1 showed that after exposure to ambiguous speech sounds halfway between /b/ and /d/ that were combined with text (b or d) participants were more likely to categorize auditory-only test sounds in accordance with the exposed letters. Experiment 2 replicated this effect with a very short exposure phase. These results show that listeners adjust their phonetic boundaries in accordance with disambiguating orthographic information and that these adjustments show a rapid build-up.
Collapse
|
9
|
|
10
|
Identifying and quantifying multisensory integration: a tutorial review. Brain Topogr 2014; 27:707-30. [PMID: 24722880 DOI: 10.1007/s10548-014-0365-7] [Citation(s) in RCA: 133] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2013] [Accepted: 03/26/2014] [Indexed: 12/19/2022]
Abstract
We process information from the world through multiple senses, and the brain must decide what information belongs together and what information should be segregated. One challenge in studying such multisensory integration is how to quantify the multisensory interactions, a challenge that is amplified by the host of methods that are now used to measure neural, behavioral, and perceptual responses. Many of the measures that have been developed to quantify multisensory integration (and which have been derived from single unit analyses), have been applied to these different measures without much consideration for the nature of the process being studied. Here, we provide a review focused on the means with which experimenters quantify multisensory processes and integration across a range of commonly used experimental methodologies. We emphasize the most commonly employed measures, including single- and multiunit responses, local field potentials, functional magnetic resonance imaging, and electroencephalography, along with behavioral measures of detection, accuracy, and response times. In each section, we will discuss the different metrics commonly used to quantify multisensory interactions, including the rationale for their use, their advantages, and the drawbacks and caveats associated with them. Also discussed are possible alternatives to the most commonly used metrics.
Collapse
|
11
|
|
12
|
|
13
|
Adaptation to different mouth shapes influences visual perception of ambiguous lip speech. Psychon Bull Rev 2010; 17:522-8. [PMID: 20702872 DOI: 10.3758/pbr.17.4.522] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We investigated the effects of adaptation to mouth shapes associated with different spoken sounds (sustained /m/ or /u/) on visual perception of lip speech. Participants were significantly more likely to label ambiguous faces on an /m/-to-/u/ continuum as saying /u/ following adaptation to /m/ mouth shapes than they were in a preadaptation test. By contrast, participants were significantly less likely to label the ambiguous faces as saying /u/ following adaptation to /u/ mouth shapes than they were in a preadaptation test. The magnitude of these aftereffects was equivalent when the same individual was shown in the adaptation and test phases of the experiment and when different individuals were presented in the adaptation and test phases. These findings present novel evidence that adaptation to natural variations in facial appearance influences face perception, and they extend previous research on face aftereffects to visual perception of lip speech.
Collapse
|
14
|
Baart M, Vroomen J. Phonetic recalibration does not depend on working memory. Exp Brain Res 2010; 203:575-82. [PMID: 20437168 PMCID: PMC2875474 DOI: 10.1007/s00221-010-2264-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2010] [Accepted: 04/14/2010] [Indexed: 11/25/2022]
Abstract
Listeners use lipread information to adjust the phonetic boundary between two speech categories (phonetic recalibration, Bertelson et al. 2003). Here, we examined phonetic recalibration while listeners were engaged in a visuospatial or verbal memory working memory task under different memory load conditions. Phonetic recalibration was--like selective speech adaptation--not affected by a concurrent verbal or visuospatial memory task. This result indicates that phonetic recalibration is a low-level process not critically depending on processes used in verbal- or visuospatial working memory.
Collapse
Affiliation(s)
- Martijn Baart
- Department of Medical Psychology and Neuropsychology, Tilburg University, Warandelaan 2, P. O. Box 90153, 5000 LE Tilburg, The Netherlands
| | - Jean Vroomen
- Department of Medical Psychology and Neuropsychology, Tilburg University, Warandelaan 2, P. O. Box 90153, 5000 LE Tilburg, The Netherlands
| |
Collapse
|
15
|
Baart M, Vroomen J. Do you see what you are hearing? Cross-modal effects of speech sounds on lipreading. Neurosci Lett 2010; 471:100-3. [PMID: 20080146 DOI: 10.1016/j.neulet.2010.01.019] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2009] [Revised: 12/14/2009] [Accepted: 01/09/2010] [Indexed: 11/28/2022]
Abstract
It is well known that visual information derived from mouth movements (i.e., lipreading) can have profound effects on auditory speech identification (e.g. the McGurk-effect). Here we examined the reverse phenomenon, namely whether auditory speech affects lipreading. We report that speech sounds dubbed onto lipread speech affect immediate identification of lipread tokens. This effect likely reflects genuine cross-modal integration of sensory signals and not just a simple response bias because we also observed adaptive shifts in visual identification of the ambiguous lipread tokens after exposure to incongruent audiovisual adapter stimuli. Presumably, listeners had learned to label the lipread stimulus in accordance with the sound, thus demonstrating that the interaction between hearing and lipreading is genuinely bi-directional.
Collapse
Affiliation(s)
- Martijn Baart
- Department of Medical Psychology and Neuropsychology, Tilburg University, P.O. Box 90153, Warandelaan 2, 5000 LE Tilburg, The Netherlands
| | | |
Collapse
|
16
|
Abstract
Adult language users have an enormous amount of experience with speech in their native language. As a result, they have very well-developed processes for categorizing the sounds of speech that they hear. Despite this very high level of experience, recent research has shown that listeners are capable of redeveloping their speech categorization to bring it into alignment with new variation in their speech input. This reorganization of phonetic space is a type of perceptual learning, or recalibration, of speech processes. In this article, we review several recent lines of research on perceptual learning for speech.
Collapse
|