1
|
Do TD, Protko CI, McMahan RP. Stepping into the Right Shoes: The Effects of User-Matched Avatar Ethnicity and Gender on Sense of Embodiment in Virtual Reality. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2024; 30:2434-2443. [PMID: 38437125 DOI: 10.1109/tvcg.2024.3372067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2024]
Abstract
In many consumer virtual reality (VR) applications, users embody predefined characters that offer minimal customization options, frequently emphasizing storytelling over user choice. We explore whether matching a user's physical characteristics, specifically ethnicity and gender, with their virtual self-avatar affects their sense of embodiment in VR. We conducted a $2\times 2$ within-subjects experiment ($\mathrm{n}=32$) with a diverse user population to explore the impact of matching or not matching a user's self-avatar to their ethnicity and gender on their sense of embodiment. Our results indicate that matching the ethnicity of the user and their self-avatar significantly enhances sense of embodiment regardless of gender, extending across various aspects, including appearance, response, and ownership. We also found that matching gender significantly enhanced ownership, suggesting that this aspect is influenced by matching both ethnicity and gender. Interestingly, we found that matching ethnicity specifically affects self-location while matching gender specifically affects one's body ownership.
Collapse
|
2
|
Tiippana K, Ujiie Y, Peromaa T, Takahashi K. Investigation of Cross-Language and Stimulus-Dependent Effects on the McGurk Effect with Finnish and Japanese Speakers and Listeners. Brain Sci 2023; 13:1198. [PMID: 37626554 PMCID: PMC10452414 DOI: 10.3390/brainsci13081198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 07/21/2023] [Accepted: 08/11/2023] [Indexed: 08/27/2023] Open
Abstract
In the McGurk effect, perception of a spoken consonant is altered when an auditory (A) syllable is presented with an incongruent visual (V) syllable (e.g., A/pa/V/ka/ is often heard as /ka/ or /ta/). The McGurk effect provides a measure for visual influence on speech perception, becoming stronger the lower the proportion of auditory correct responses. Cross-language effects are studied to understand processing differences between one's own and foreign languages. Regarding the McGurk effect, it has sometimes been found to be stronger with foreign speakers. However, other studies have shown the opposite, or no difference between languages. Most studies have compared English with other languages. We investigated cross-language effects with native Finnish and Japanese speakers and listeners. Both groups of listeners had 49 participants. The stimuli (/ka/, /pa/, /ta/) were uttered by two female and male Finnish and Japanese speakers and presented in A, V and AV modality, including a McGurk stimulus A/pa/V/ka/. The McGurk effect was stronger with Japanese stimuli in both groups. Differences in speech perception were prominent between individual speakers but less so between native languages. Unisensory perception correlated with McGurk perception. These findings suggest that stimulus-dependent features contribute to the McGurk effect. This may have a stronger influence on syllable perception than cross-language factors.
Collapse
Affiliation(s)
- Kaisa Tiippana
- Department of Psychology and Logopedics, University of Helsinki, 00014 Helsinki, Finland
| | - Yuta Ujiie
- Department of Psychology, College of Contemporary Psychology, Rikkyo University, Saitama 352-8558, Japan
- Research Organization of Open Innovation and Collaboration, Ritsumeikan University, Osaka 567-8570, Japan
| | - Tarja Peromaa
- Department of Psychology and Logopedics, University of Helsinki, 00014 Helsinki, Finland
| | - Kohske Takahashi
- College of Comprehensive Psychology, Ritsumeikan University, Osaka 567-8570, Japan
| |
Collapse
|
3
|
Van Engen KJ, Dey A, Sommers MS, Peelle JE. Audiovisual speech perception: Moving beyond McGurk. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:3216. [PMID: 36586857 PMCID: PMC9894660 DOI: 10.1121/10.0015262] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 10/26/2022] [Accepted: 11/05/2022] [Indexed: 05/29/2023]
Abstract
Although it is clear that sighted listeners use both auditory and visual cues during speech perception, the manner in which multisensory information is combined is a matter of debate. One approach to measuring multisensory integration is to use variants of the McGurk illusion, in which discrepant auditory and visual cues produce auditory percepts that differ from those based on unimodal input. Not all listeners show the same degree of susceptibility to the McGurk illusion, and these individual differences are frequently used as a measure of audiovisual integration ability. However, despite their popularity, we join the voices of others in the field to argue that McGurk tasks are ill-suited for studying real-life multisensory speech perception: McGurk stimuli are often based on isolated syllables (which are rare in conversations) and necessarily rely on audiovisual incongruence that does not occur naturally. Furthermore, recent data show that susceptibility to McGurk tasks does not correlate with performance during natural audiovisual speech perception. Although the McGurk effect is a fascinating illusion, truly understanding the combined use of auditory and visual information during speech perception requires tasks that more closely resemble everyday communication: namely, words, sentences, and narratives with congruent auditory and visual speech cues.
Collapse
Affiliation(s)
- Kristin J Van Engen
- Department of Psychological and Brain Sciences, Washington University, St. Louis, Missouri 63130, USA
| | - Avanti Dey
- PLOS ONE, 1265 Battery Street, San Francisco, California 94111, USA
| | - Mitchell S Sommers
- Department of Psychological and Brain Sciences, Washington University, St. Louis, Missouri 63130, USA
| | - Jonathan E Peelle
- Department of Otolaryngology, Washington University, St. Louis, Missouri 63130, USA
| |
Collapse
|
4
|
Gijbels L, Yeatman JD, Lalonde K, Lee AKC. Audiovisual Speech Processing in Relationship to Phonological and Vocabulary Skills in First Graders. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2021; 64:5022-5040. [PMID: 34735292 PMCID: PMC9150669 DOI: 10.1044/2021_jslhr-21-00196] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 07/06/2021] [Accepted: 08/11/2021] [Indexed: 06/13/2023]
Abstract
PURPOSE It is generally accepted that adults use visual cues to improve speech intelligibility in noisy environments, but findings regarding visual speech benefit in children are mixed. We explored factors that contribute to audiovisual (AV) gain in young children's speech understanding. We examined whether there is an AV benefit to speech-in-noise recognition in children in first grade and if visual salience of phonemes influences their AV benefit. We explored if individual differences in AV speech enhancement could be explained by vocabulary knowledge, phonological awareness, or general psychophysical testing performance. METHOD Thirty-seven first graders completed online psychophysical experiments. We used an online single-interval, four-alternative forced-choice picture-pointing task with age-appropriate consonant-vowel-consonant words to measure auditory-only, visual-only, and AV word recognition in noise at -2 and -8 dB SNR. We obtained standard measures of vocabulary and phonological awareness and included a general psychophysical test to examine correlations with AV benefits. RESULTS We observed a significant overall AV gain among children in first grade. This effect was mainly attributed to the benefit at -8 dB SNR, for visually distinct targets. Individual differences were not explained by any of the child variables. Boys showed lower auditory-only performances, leading to significantly larger AV gains. CONCLUSIONS This study shows AV benefit, of distinctive visual cues, to word recognition in challenging noisy conditions in first graders. The cognitive and linguistic constraints of the task may have minimized the impact of individual differences of vocabulary and phonological awareness on AV benefit. The gender difference should be studied on a larger sample and age range.
Collapse
Affiliation(s)
- Liesbeth Gijbels
- Department of Speech & Hearing Sciences, University of Washington, Seattle
- Institute for Learning & Brain Sciences, University of Washington, Seattle
| | - Jason D. Yeatman
- Division of Developmental-Behavioral Pediatrics, School of Medicine, Stanford University, CA
- Graduate School of Education, Stanford University, CA
| | - Kaylah Lalonde
- Boys Town National Research Hospital, Center for Hearing Research, Omaha, NE
| | - Adrian K. C. Lee
- Department of Speech & Hearing Sciences, University of Washington, Seattle
- Institute for Learning & Brain Sciences, University of Washington, Seattle
| |
Collapse
|
5
|
Visual Influences on Auditory Behavioral, Neural, and Perceptual Processes: A Review. J Assoc Res Otolaryngol 2021; 22:365-386. [PMID: 34014416 PMCID: PMC8329114 DOI: 10.1007/s10162-021-00789-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 02/07/2021] [Indexed: 01/03/2023] Open
Abstract
In a naturalistic environment, auditory cues are often accompanied by information from other senses, which can be redundant with or complementary to the auditory information. Although the multisensory interactions derived from this combination of information and that shape auditory function are seen across all sensory modalities, our greatest body of knowledge to date centers on how vision influences audition. In this review, we attempt to capture the state of our understanding at this point in time regarding this topic. Following a general introduction, the review is divided into 5 sections. In the first section, we review the psychophysical evidence in humans regarding vision's influence in audition, making the distinction between vision's ability to enhance versus alter auditory performance and perception. Three examples are then described that serve to highlight vision's ability to modulate auditory processes: spatial ventriloquism, cross-modal dynamic capture, and the McGurk effect. The final part of this section discusses models that have been built based on available psychophysical data and that seek to provide greater mechanistic insights into how vision can impact audition. The second section reviews the extant neuroimaging and far-field imaging work on this topic, with a strong emphasis on the roles of feedforward and feedback processes, on imaging insights into the causal nature of audiovisual interactions, and on the limitations of current imaging-based approaches. These limitations point to a greater need for machine-learning-based decoding approaches toward understanding how auditory representations are shaped by vision. The third section reviews the wealth of neuroanatomical and neurophysiological data from animal models that highlights audiovisual interactions at the neuronal and circuit level in both subcortical and cortical structures. It also speaks to the functional significance of audiovisual interactions for two critically important facets of auditory perception-scene analysis and communication. The fourth section presents current evidence for alterations in audiovisual processes in three clinical conditions: autism, schizophrenia, and sensorineural hearing loss. These changes in audiovisual interactions are postulated to have cascading effects on higher-order domains of dysfunction in these conditions. The final section highlights ongoing work seeking to leverage our knowledge of audiovisual interactions to develop better remediation approaches to these sensory-based disorders, founded in concepts of perceptual plasticity in which vision has been shown to have the capacity to facilitate auditory learning.
Collapse
|
6
|
Ujiie Y, Takahashi K. Weaker McGurk Effect for Rubin's Vase-Type Speech in People With High Autistic Traits. Multisens Res 2021; 34:1-17. [PMID: 33873157 DOI: 10.1163/22134808-bja10047] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 04/05/2021] [Indexed: 11/19/2022]
Abstract
While visual information from facial speech modulates auditory speech perception, it is less influential on audiovisual speech perception among autistic individuals than among typically developed individuals. In this study, we investigated the relationship between autistic traits (Autism-Spectrum Quotient; AQ) and the influence of visual speech on the recognition of Rubin's vase-type speech stimuli with degraded facial speech information. Participants were 31 university students (13 males and 18 females; mean age: 19.2, SD: 1.13 years) who reported normal (or corrected-to-normal) hearing and vision. All participants completed three speech recognition tasks (visual, auditory, and audiovisual stimuli) and the AQ-Japanese version. The results showed that accuracies of speech recognition for visual (i.e., lip-reading) and auditory stimuli were not significantly related to participants' AQ. In contrast, audiovisual speech perception was less susceptible to facial speech perception among individuals with high rather than low autistic traits. The weaker influence of visual information on audiovisual speech perception in autism spectrum disorder (ASD) was robust regardless of the clarity of the visual information, suggesting a difficulty in the process of audiovisual integration rather than in the visual processing of facial speech.
Collapse
Affiliation(s)
- Yuta Ujiie
- Graduate School of Psychology, Chukyo University, 101-2 Yagoto Honmachi, Showa-ku, Nagoya-shi, Aichi, 466-8666, Japan
- Japan Society for the Promotion of Science, Kojimachi Business Center Building, 5-3-1 Kojimachi, Chiyoda-ku, Tokyo 102-0083, Japan
- Research and Development Initiative, Chuo University, 1-13-27, Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan
| | - Kohske Takahashi
- School of Psychology, Chukyo University, 101-2 Yagoto Honmachi, Showa-ku, Nagoya-shi, Aichi, 466-8666, Japan
| |
Collapse
|
7
|
Abstract
A speech signal carries information about meaning and about the talker conveying that meaning. It is now known that these two dimensions are related. There is evidence that gaining experience with a particular talker in one modality not only facilitates better phonetic perception in that modality, but also transfers across modalities to allow better phonetic perception in the other. This finding suggests that experience with a talker provides familiarity with some amodal properties of their articulation such that the experience can be shared across modalities. The present study investigates if experience with talker-specific articulatory information can also support cross-modal talker learning. In Experiment 1 we show that participants can learn to identify ten novel talkers from point-light and sinewave speech, expanding on prior work. Point-light and sinewave speech also supported similar talker identification accuracies, and similar patterns of talker confusions were found across stimulus types. Experiment 2 showed these stimuli could also support cross-modal talker matching, further expanding on prior work. Finally, in Experiment 3 we show that learning to identify talkers in one modality (visual-only point-light speech) facilitates learning of those same talkers in another modality (auditory-only sinewave speech). These results suggest that some of the information for talker identity takes a modality-independent form.
Collapse
|
8
|
Thézé R, Gadiri MA, Albert L, Provost A, Giraud AL, Mégevand P. Animated virtual characters to explore audio-visual speech in controlled and naturalistic environments. Sci Rep 2020; 10:15540. [PMID: 32968127 PMCID: PMC7511320 DOI: 10.1038/s41598-020-72375-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Accepted: 08/31/2020] [Indexed: 11/09/2022] Open
Abstract
Natural speech is processed in the brain as a mixture of auditory and visual features. An example of the importance of visual speech is the McGurk effect and related perceptual illusions that result from mismatching auditory and visual syllables. Although the McGurk effect has widely been applied to the exploration of audio-visual speech processing, it relies on isolated syllables, which severely limits the conclusions that can be drawn from the paradigm. In addition, the extreme variability and the quality of the stimuli usually employed prevents comparability across studies. To overcome these limitations, we present an innovative methodology using 3D virtual characters with realistic lip movements synchronized on computer-synthesized speech. We used commercially accessible and affordable tools to facilitate reproducibility and comparability, and the set-up was validated on 24 participants performing a perception task. Within complete and meaningful French sentences, we paired a labiodental fricative viseme (i.e. /v/) with a bilabial occlusive phoneme (i.e. /b/). This audiovisual mismatch is known to induce the illusion of hearing /v/ in a proportion of trials. We tested the rate of the illusion while varying the magnitude of background noise and audiovisual lag. Overall, the effect was observed in 40% of trials. The proportion rose to about 50% with added background noise and up to 66% when controlling for phonetic features. Our results conclusively demonstrate that computer-generated speech stimuli are judicious, and that they can supplement natural speech with higher control over stimulus timing and content.
Collapse
Affiliation(s)
- Raphaël Thézé
- Department of Basic Neurosciences, University of Geneva, Campus Biotech, Chemin des Mines 9, 1202, Geneva, Switzerland
| | - Mehdi Ali Gadiri
- Department of Basic Neurosciences, University of Geneva, Campus Biotech, Chemin des Mines 9, 1202, Geneva, Switzerland
| | - Louis Albert
- Human Neuroscience Platform, Fondation Campus Biotech Geneva, Geneva, Switzerland
| | - Antoine Provost
- Human Neuroscience Platform, Fondation Campus Biotech Geneva, Geneva, Switzerland
| | - Anne-Lise Giraud
- Department of Basic Neurosciences, University of Geneva, Campus Biotech, Chemin des Mines 9, 1202, Geneva, Switzerland
| | - Pierre Mégevand
- Department of Basic Neurosciences, University of Geneva, Campus Biotech, Chemin des Mines 9, 1202, Geneva, Switzerland. .,Division of Neurology, Geneva University Hospitals, Geneva, Switzerland.
| |
Collapse
|
9
|
Maffei V, Indovina I, Mazzarella E, Giusti MA, Macaluso E, Lacquaniti F, Viviani P. Sensitivity of occipito-temporal cortex, premotor and Broca's areas to visible speech gestures in a familiar language. PLoS One 2020; 15:e0234695. [PMID: 32559213 PMCID: PMC7304574 DOI: 10.1371/journal.pone.0234695] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2019] [Accepted: 06/01/2020] [Indexed: 11/18/2022] Open
Abstract
When looking at a speaking person, the analysis of facial kinematics contributes to language discrimination and to the decoding of the time flow of visual speech. To disentangle these two factors, we investigated behavioural and fMRI responses to familiar and unfamiliar languages when observing speech gestures with natural or reversed kinematics. Twenty Italian volunteers viewed silent video-clips of speech shown as recorded (Forward, biological motion) or reversed in time (Backward, non-biological motion), in Italian (familiar language) or Arabic (non-familiar language). fMRI revealed that language (Italian/Arabic) and time-rendering (Forward/Backward) modulated distinct areas in the ventral occipito-temporal cortex, suggesting that visual speech analysis begins in this region, earlier than previously thought. Left premotor ventral (superior subdivision) and dorsal areas were preferentially activated with the familiar language independently of time-rendering, challenging the view that the role of these regions in speech processing is purely articulatory. The left premotor ventral region in the frontal operculum, thought to include part of the Broca's area, responded to the natural familiar language, consistent with the hypothesis of motor simulation of speech gestures.
Collapse
Affiliation(s)
- Vincenzo Maffei
- Laboratory of Neuromotor Physiology, IRCCS Santa Lucia Foundation, Rome, Italy.,Centre of Space BioMedicine and Department of Systems Medicine, University of Rome Tor Vergata, Rome, Italy.,Data Lake & BI, DOT - Technology, Poste Italiane, Rome, Italy
| | - Iole Indovina
- Laboratory of Neuromotor Physiology, IRCCS Santa Lucia Foundation, Rome, Italy.,Departmental Faculty of Medicine and Surgery, Saint Camillus International University of Health and Medical Sciences, Rome, Italy
| | | | - Maria Assunta Giusti
- Centre of Space BioMedicine and Department of Systems Medicine, University of Rome Tor Vergata, Rome, Italy
| | - Emiliano Macaluso
- ImpAct Team, Lyon Neuroscience Research Center, Lyon, France.,Laboratory of Neuroimaging, IRCCS Santa Lucia Foundation, Rome, Italy
| | - Francesco Lacquaniti
- Laboratory of Neuromotor Physiology, IRCCS Santa Lucia Foundation, Rome, Italy.,Centre of Space BioMedicine and Department of Systems Medicine, University of Rome Tor Vergata, Rome, Italy
| | - Paolo Viviani
- Laboratory of Neuromotor Physiology, IRCCS Santa Lucia Foundation, Rome, Italy.,Centre of Space BioMedicine and Department of Systems Medicine, University of Rome Tor Vergata, Rome, Italy
| |
Collapse
|
10
|
Borowiak K, Maguinness C, von Kriegstein K. Dorsal-movement and ventral-form regions are functionally connected during visual-speech recognition. Hum Brain Mapp 2020; 41:952-972. [PMID: 31749219 PMCID: PMC7267922 DOI: 10.1002/hbm.24852] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2019] [Revised: 09/03/2019] [Accepted: 10/21/2019] [Indexed: 01/17/2023] Open
Abstract
Faces convey social information such as emotion and speech. Facial emotion processing is supported via interactions between dorsal-movement and ventral-form visual cortex regions. Here, we explored, for the first time, whether similar dorsal-ventral interactions (assessed via functional connectivity), might also exist for visual-speech processing. We then examined whether altered dorsal-ventral connectivity is observed in adults with high-functioning autism spectrum disorder (ASD), a disorder associated with impaired visual-speech recognition. We acquired functional magnetic resonance imaging (fMRI) data with concurrent eye tracking in pairwise matched control and ASD participants. In both groups, dorsal-movement regions in the visual motion area 5 (V5/MT) and the temporal visual speech area (TVSA) were functionally connected to ventral-form regions (i.e., the occipital face area [OFA] and the fusiform face area [FFA]) during the recognition of visual speech, in contrast to the recognition of face identity. Notably, parts of this functional connectivity were decreased in the ASD group compared to the controls (i.e., right V5/MT-right OFA, left TVSA-left FFA). The results confirmed our hypothesis that functional connectivity between dorsal-movement and ventral-form regions exists during visual-speech processing. Its partial dysfunction in ASD might contribute to difficulties in the recognition of dynamic face information relevant for successful face-to-face communication.
Collapse
Affiliation(s)
- Kamila Borowiak
- Chair of Cognitive and Clinical Neuroscience, Faculty of Psychology, Technische Universität DresdenDresdenGermany
- Max Planck Institute for Human Cognitive and Brain SciencesLeipzigGermany
- Berlin School of Mind and Brain, Humboldt University of BerlinBerlinGermany
| | - Corrina Maguinness
- Chair of Cognitive and Clinical Neuroscience, Faculty of Psychology, Technische Universität DresdenDresdenGermany
- Max Planck Institute for Human Cognitive and Brain SciencesLeipzigGermany
| | - Katharina von Kriegstein
- Chair of Cognitive and Clinical Neuroscience, Faculty of Psychology, Technische Universität DresdenDresdenGermany
- Max Planck Institute for Human Cognitive and Brain SciencesLeipzigGermany
| |
Collapse
|
11
|
Stawicki M, Majdak P, Başkent D. Ventriloquist Illusion Produced With Virtual Acoustic Spatial Cues and Asynchronous Audiovisual Stimuli in Both Young and Older Individuals. Multisens Res 2019; 32:745-770. [DOI: 10.1163/22134808-20191430] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2019] [Accepted: 09/03/2019] [Indexed: 11/19/2022]
Abstract
Abstract
Ventriloquist illusion, the change in perceived location of an auditory stimulus when a synchronously presented but spatially discordant visual stimulus is added, has been previously shown in young healthy populations to be a robust paradigm that mainly relies on automatic processes. Here, we propose ventriloquist illusion as a potential simple test to assess audiovisual (AV) integration in young and older individuals. We used a modified version of the illusion paradigm that was adaptive, nearly bias-free, relied on binaural stimulus representation using generic head-related transfer functions (HRTFs) instead of multiple loudspeakers, and tested with synchronous and asynchronous presentation of AV stimuli (both tone and speech). The minimum audible angle (MAA), the smallest perceptible difference in angle between two sound sources, was compared with or without the visual stimuli in young and older adults with no or minimal sensory deficits. The illusion effect, measured by means of MAAs implemented with HRTFs, was observed with both synchronous and asynchronous visual stimulus, but only with tone and not speech stimulus. The patterns were similar between young and older individuals, indicating the versatility of the modified ventriloquist illusion paradigm.
Collapse
Affiliation(s)
- Marnix Stawicki
- 1Department of Otorhinolaryngology / Head and Neck Surgery, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
- 2Graduate School of Medical Sciences, Research School of Behavioral and Cognitive Neurosciences (BCN), University of Groningen, Groningen, The Netherlands
| | - Piotr Majdak
- 3Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria
| | - Deniz Başkent
- 1Department of Otorhinolaryngology / Head and Neck Surgery, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
- 2Graduate School of Medical Sciences, Research School of Behavioral and Cognitive Neurosciences (BCN), University of Groningen, Groningen, The Netherlands
| |
Collapse
|
12
|
Attentional resources contribute to the perceptual learning of talker idiosyncrasies in audiovisual speech. Atten Percept Psychophys 2019; 81:1006-1019. [PMID: 30684204 DOI: 10.3758/s13414-018-01651-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
To recognize audiovisual speech, listeners evaluate and combine information obtained from the auditory and visual modalities. Listeners also use information from one modality to adjust their phonetic categories to a talker's idiosyncrasy encountered in the other modality. In this study, we examined whether the outcome of this cross-modal recalibration relies on attentional resources. In a standard recalibration experiment in Experiment 1, participants heard an ambiguous sound, disambiguated by the accompanying visual speech as either /p/ or /t/. Participants' primary task was to attend to the audiovisual speech while either monitoring a tone sequence for a target tone or ignoring the tones. Listeners subsequently categorized the steps of an auditory /p/-/t/ continuum more often in line with their exposure. The aftereffect of phonetic recalibration was reduced, but not eliminated, by attentional load during exposure. In Experiment 2, participants saw an ambiguous visual speech gesture that was disambiguated auditorily as either /p/ or /t/. At test, listeners categorized the steps of a visual /p/-/t/ continuum more often in line with the prior exposure. Imposing load in the auditory modality during exposure did not reduce the aftereffect of this type of cross-modal phonetic recalibration. Together, these results suggest that auditory attentional resources are needed for the processing of auditory speech and/or for the shifting of auditory phonetic category boundaries. Listeners thus need to dedicate attentional resources in order to accommodate talker idiosyncrasies in audiovisual speech.
Collapse
|
13
|
Abstract
Speech research during recent years has moved progressively away from its traditional focus on audition toward a more multisensory approach. In addition to audition and vision, many somatosenses including proprioception, pressure, vibration and aerotactile sensation are all highly relevant modalities for experiencing and/or conveying speech. In this article, we review both long-standing cross-modal effects stemming from decades of audiovisual speech research as well as new findings related to somatosensory effects. Cross-modal effects in speech perception to date are found to be constrained by temporal congruence and signal relevance, but appear to be unconstrained by spatial congruence. Far from taking place in a one-, two- or even three-dimensional space, the literature reveals that speech occupies a highly multidimensional sensory space. We argue that future research in cross-modal effects should expand to consider each of these modalities both separately and in combination with other modalities in speech.
Collapse
Affiliation(s)
- Megan Keough
- Interdisciplinary Speech Research Lab, Department of Linguistics, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - Donald Derrick
- New Zealand Institute of Brain and Behaviour, University of Canterbury, Christchurch 8140, New Zealand
- MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Penrith, New South Wales 2751, Australia
| | - Bryan Gick
- Interdisciplinary Speech Research Lab, Department of Linguistics, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
- Haskins Laboratories, Yale University, New Haven, CT 06511, USA
| |
Collapse
|
14
|
Morís Fernández L, Torralba M, Soto-Faraco S. Theta oscillations reflect conflict processing in the perception of the McGurk illusion. Eur J Neurosci 2018; 48:2630-2641. [DOI: 10.1111/ejn.13804] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Revised: 12/12/2017] [Accepted: 12/12/2017] [Indexed: 11/27/2022]
Affiliation(s)
- Luis Morís Fernández
- Multisensory Research Group; Center for Brain and Cognition; Dept. de Tecnologies de la Informació i les Comunicacions; Universitat Pompeu Fabra; Office 55.128., Roc Boronat, 138 08018 Barcelona Spain
| | - Mireia Torralba
- Multisensory Research Group; Center for Brain and Cognition; Dept. de Tecnologies de la Informació i les Comunicacions; Universitat Pompeu Fabra; Office 55.128., Roc Boronat, 138 08018 Barcelona Spain
| | - Salvador Soto-Faraco
- Multisensory Research Group; Center for Brain and Cognition; Dept. de Tecnologies de la Informació i les Comunicacions; Universitat Pompeu Fabra; Office 55.128., Roc Boronat, 138 08018 Barcelona Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA); Barcelona Spain
| |
Collapse
|
15
|
Abstract
While audiovisual integration is well known in speech perception, faces and speech are also informative with respect to speaker recognition. To date, audiovisual integration in the recognition of familiar people has never been demonstrated. Here we show systematic benefits and costs for the recognition of familiar voices when these are combined with time-synchronized articulating faces, of corresponding or noncorresponding speaker identity, respectively. While these effects were strong for familiar voices, they were smaller or nonsignificant for unfamiliar voices, suggesting that the effects depend on the previous creation of a multimodal representation of a person's identity. Moreover, the effects were reduced or eliminated when voices were combined with the same faces presented as static pictures, demonstrating that the effects do not simply reflect the use of facial identity as a “cue” for voice recognition. This is the first direct evidence for audiovisual integration in person recognition.
Collapse
|
16
|
Jansen SD, Keebler JR, Chaparro A. Shifts in Maximum Audiovisual Integration with Age. Multisens Res 2018; 31:191-212. [DOI: 10.1163/22134808-00002599] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2017] [Accepted: 07/14/2017] [Indexed: 11/19/2022]
Abstract
Listeners attempting to understand speech in noisy environments rely on visual and auditory processes, typically referred to as audiovisual processing. Noise corrupts the auditory speech signal and listeners naturally leverage visual cues from the talker’s face in an attempt to interpret the degraded auditory signal. Studies of speech intelligibility in noise show that the maximum improvement in speech recognition performance (i.e., maximum visual enhancement or VEmax), derived from seeing an interlocutor’s face, is invariant with age. Several studies have reported that VEmaxis typically associated with a signal-to-noise (SNR) of −12 dB; however, few studies have systematically investigated whether the SNR associated with VEmaxchanges with age. We investigated if VEmaxchanges as a function of age, whether the SNR at VEmaxchanges as a function of age, and what perceptual/cognitive abilities account for or mediate such relationships. We measured VEmaxon a nongeriatric adult sample () ranging in age from 20 to 59 years old. We found that VEmaxwas age-invariant, replicating earlier studies. No perceptual/cognitive measures predicted VEmax, most likely due to limited variance in VEmaxscores. Importantly, we found that the SNR at VEmaxshifts toward higher (quieter) SNR levels with increasing age; however, this relationship is partially mediated by working memory capacity, where those with larger working memory capacities (WMCs) can identify speech under lower (louder) SNR levels than their age equivalents with smaller WMCs. The current study is the first to report that individual differences in WMC partially mediate the age-related shift in SNR at VEmax.
Collapse
Affiliation(s)
| | - Joseph R. Keebler
- Department of Human Factors and Behavioral Neurobiology, Embry-Riddle Aeronautical University, Daytona Beach, FL, USA
| | - Alex Chaparro
- Department of Human Factors and Behavioral Neurobiology, Embry-Riddle Aeronautical University, Daytona Beach, FL, USA
| |
Collapse
|
17
|
Alsius A, Paré M, Munhall KG. Forty Years After Hearing Lips and Seeing Voices: the McGurk Effect Revisited. Multisens Res 2018; 31:111-144. [PMID: 31264597 DOI: 10.1163/22134808-00002565] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Accepted: 03/09/2017] [Indexed: 11/19/2022]
Abstract
Since its discovery 40 years ago, the McGurk illusion has been usually cited as a prototypical paradigmatic case of multisensory binding in humans, and has been extensively used in speech perception studies as a proxy measure for audiovisual integration mechanisms. Despite the well-established practice of using the McGurk illusion as a tool for studying the mechanisms underlying audiovisual speech integration, the magnitude of the illusion varies enormously across studies. Furthermore, the processing of McGurk stimuli differs from congruent audiovisual processing at both phenomenological and neural levels. This questions the suitability of this illusion as a tool to quantify the necessary and sufficient conditions under which audiovisual integration occurs in natural conditions. In this paper, we review some of the practical and theoretical issues related to the use of the McGurk illusion as an experimental paradigm. We believe that, without a richer understanding of the mechanisms involved in the processing of the McGurk effect, experimenters should be really cautious when generalizing data generated by McGurk stimuli to matching audiovisual speech events.
Collapse
Affiliation(s)
- Agnès Alsius
- Psychology Department, Queen's University, Humphrey Hall, 62 Arch St., Kingston, Ontario, K7L 3N6 Canada
| | - Martin Paré
- Psychology Department, Queen's University, Humphrey Hall, 62 Arch St., Kingston, Ontario, K7L 3N6 Canada
| | - Kevin G Munhall
- Psychology Department, Queen's University, Humphrey Hall, 62 Arch St., Kingston, Ontario, K7L 3N6 Canada
| |
Collapse
|
18
|
Affiliation(s)
| | | | - Benoît G. Bardy
- EuroMov, Université de Montpellier, Institut Universitaire de France
| |
Collapse
|
19
|
Abstract
There is evidence that for both auditory and visual speech perception, familiarity with the talker facilitates speech recognition. Explanations of these effects have concentrated on the retention of talker information specific to each of these modalities. It could be, however, that some amodal, talker-specific articulatory-style information facilitates speech perception in both modalities. If this is true, then experience with a talker in one modality should facilitate perception of speech from that talker in the other modality. In a test of this prediction, subjects were given about 1 hr of experience lipreading a talker and were then asked to recover speech in noise from either this same talker or a different talker. Results revealed that subjects who lip-read and heard speech from the same talker performed better on the speech-in-noise task than did subjects who lip-read from one talker and then heard speech from a different talker.
Collapse
|
20
|
Rosenblum LD, Dorsi J, Dias JW. The Impact and Status of Carol Fowler's Supramodal Theory of Multisensory Speech Perception. ECOLOGICAL PSYCHOLOGY 2016. [DOI: 10.1080/10407413.2016.1230373] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
21
|
Hisanaga S, Sekiyama K, Igasaki T, Murayama N. Language/Culture Modulates Brain and Gaze Processes in Audiovisual Speech Perception. Sci Rep 2016; 6:35265. [PMID: 27734953 PMCID: PMC5062344 DOI: 10.1038/srep35265] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2015] [Accepted: 09/27/2016] [Indexed: 11/25/2022] Open
Abstract
Several behavioural studies have shown that the interplay between voice and face information in audiovisual speech perception is not universal. Native English speakers (ESs) are influenced by visual mouth movement to a greater degree than native Japanese speakers (JSs) when listening to speech. However, the biological basis of these group differences is unknown. Here, we demonstrate the time-varying processes of group differences in terms of event-related brain potentials (ERP) and eye gaze for audiovisual and audio-only speech perception. On a behavioural level, while congruent mouth movement shortened the ESs’ response time for speech perception, the opposite effect was observed in JSs. Eye-tracking data revealed a gaze bias to the mouth for the ESs but not the JSs, especially before the audio onset. Additionally, the ERP P2 amplitude indicated that ESs processed multisensory speech more efficiently than auditory-only speech; however, the JSs exhibited the opposite pattern. Taken together, the ESs’ early visual attention to the mouth was likely to promote phonetic anticipation, which was not the case for the JSs. These results clearly indicate the impact of language and/or culture on multisensory speech processing, suggesting that linguistic/cultural experiences lead to the development of unique neural systems for audiovisual speech perception.
Collapse
Affiliation(s)
- Satoko Hisanaga
- Division of Cognitive Psychology, Faculty of Letters, Kumamoto University 2-40-1, Kurokami, Chuo-ku, Kumamoto, 860-8555, Japan
| | - Kaoru Sekiyama
- Division of Cognitive Psychology, Faculty of Letters, Kumamoto University 2-40-1, Kurokami, Chuo-ku, Kumamoto, 860-8555, Japan
| | - Tomohiko Igasaki
- Department of Information Technology on Human and Environmental Science, Graduate School of Science and Technology, Kumamoto University, 2-39-1, Kurokami, Chuo-ku, Kumamoto, 860-8555, Japan
| | - Nobuki Murayama
- Department of Information Technology on Human and Environmental Science, Graduate School of Science and Technology, Kumamoto University, 2-39-1, Kurokami, Chuo-ku, Kumamoto, 860-8555, Japan
| |
Collapse
|
22
|
Kaufmann JM, Schweinberger SR. Speaker Variations Influence Speechreading Speed for Dynamic Faces. Perception 2016; 34:595-610. [PMID: 15991696 DOI: 10.1068/p5104] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
We investigated the influence of task-irrelevant speaker variations on speechreading performance. In three experiments with video digitised faces presented either in dynamic, static-sequential, or static mode, participants performed speeded classifications on vowel utterances (German vowels /u/ and /i/). A Garner interference paradigm was used, in which speaker identity was task-irrelevant but could be either correlated, constant, or orthogonal to the vowel uttered. Reaction times for facial speech classifications were slowed by task-irrelevant speaker variations for dynamic stimuli. The results are discussed with reference to distributed models of face perception (Haxby et al, 2000 Trends in Cognitive Sciences4 223–233) and the relevance of both dynamic information and speaker characteristics for speechreading.
Collapse
Affiliation(s)
- Jürgen M Kaufmann
- Department of Psychology, University of Glasgow, 58 Hillhead Street, Glasgow G12 8QB, Scotland, UK.
| | | |
Collapse
|
23
|
McCotter MV, Jordan TR. The Role of Facial Colour and Luminance in Visual and Audiovisual Speech Perception. Perception 2016; 32:921-36. [PMID: 14580139 DOI: 10.1068/p3316] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
We conducted four experiments to investigate the role of colour and luminance information in visual and audiovisual speech perception. In experiments la (stimuli presented in quiet conditions) and 1b (stimuli presented in auditory noise), face display types comprised naturalistic colour (NC), grey-scale (GS), and luminance inverted (LI) faces. In experiments 2a (quiet) and 2b (noise), face display types comprised NC, colour inverted (CI), LI, and colour and luminance inverted (CLI) faces. Six syllables and twenty-two words were used to produce auditory and visual speech stimuli. Auditory and visual signals were combined to produce congruent and incongruent audiovisual speech stimuli. Experiments 1a and 1b showed that perception of visual speech, and its influence on identifying the auditory components of congruent and incongruent audiovisual speech, was less for LI than for either NC or GS faces, which produced identical results. Experiments 2a and 2b showed that perception of visual speech, and influences on perception of incongruent auditory speech, was less for LI and CLI faces than for NC and CI faces (which produced identical patterns of performance). Our findings for NC and CI faces suggest that colour is not critical for perception of visual and audiovisual speech. The effect of luminance inversion on performance accuracy was relatively small (5%), which suggests that the luminance information preserved in LI faces is important for the processing of visual and audiovisual speech.
Collapse
Affiliation(s)
- Maxine V McCotter
- School of Psychology, University of Nottingham, University Park, Nottingham NG7 2RD, UK.
| | | |
Collapse
|
24
|
Roark DA, Barrett SE, Spence MJ, Abdi H, O'Toole AJ. Psychological and Neural Perspectives on the Role of Motion in Face Recognition. ACTA ACUST UNITED AC 2016; 2:15-46. [PMID: 17715597 DOI: 10.1177/1534582303002001002] [Citation(s) in RCA: 61] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In the real world, faces are in constant motion. Recently, researchers have begun to consider how facial motion affects memory for faces. The authors offer a theoretical framework that synthesizes psychological findings on memory for moving faces. Three hypotheses about the possible roles of facial motion in memory are evaluated. In general, although facial motion is helpful for recognizing familiar/famous faces, its benefits are less certain with unfamiliar faces. Importantly, the implicit social signals provided by a moving face (e.g., gaze changes, expression, and facial speech) may mediate the effects of facial motion on recognition. Insights from the developmental literature, which highlight the significance of attention in the processing of social information from faces, are also discussed. Finally, a neural systems framework that considers both the processing of socially relevant motion information and static feature-based information is presented. This neural systems model provides a useful framework for understanding the divergent psychological findings.
Collapse
|
25
|
Rosenblum LD, Dias JW, Dorsi J. The supramodal brain: implications for auditory perception. JOURNAL OF COGNITIVE PSYCHOLOGY 2016. [DOI: 10.1080/20445911.2016.1181691] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
26
|
Pratt H, Bleich N, Mittelman N. Spatio-temporal distribution of brain activity associated with audio-visually congruent and incongruent speech and the McGurk Effect. Brain Behav 2015; 5:e00407. [PMID: 26664791 PMCID: PMC4667754 DOI: 10.1002/brb3.407] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/08/2015] [Revised: 08/26/2015] [Accepted: 09/07/2015] [Indexed: 12/04/2022] Open
Abstract
INTRODUCTION Spatio-temporal distributions of cortical activity to audio-visual presentations of meaningless vowel-consonant-vowels and the effects of audio-visual congruence/incongruence, with emphasis on the McGurk effect, were studied. The McGurk effect occurs when a clearly audible syllable with one consonant, is presented simultaneously with a visual presentation of a face articulating a syllable with a different consonant and the resulting percept is a syllable with a consonant other than the auditorily presented one. METHODS Twenty subjects listened to pairs of audio-visually congruent or incongruent utterances and indicated whether pair members were the same or not. Source current densities of event-related potentials to the first utterance in the pair were estimated and effects of stimulus-response combinations, brain area, hemisphere, and clarity of visual articulation were assessed. RESULTS Auditory cortex, superior parietal cortex, and middle temporal cortex were the most consistently involved areas across experimental conditions. Early (<200 msec) processing of the consonant was overall prominent in the left hemisphere, except right hemisphere prominence in superior parietal cortex and secondary visual cortex. Clarity of visual articulation impacted activity in secondary visual cortex and Wernicke's area. McGurk perception was associated with decreased activity in primary and secondary auditory cortices and Wernicke's area before 100 msec, increased activity around 100 msec which decreased again around 180 msec. Activity in Broca's area was unaffected by McGurk perception and was only increased to congruent audio-visual stimuli 30-70 msec following consonant onset. CONCLUSIONS The results suggest left hemisphere prominence in the effects of stimulus and response conditions on eight brain areas involved in dynamically distributed parallel processing of audio-visual integration. Initially (30-70 msec) subcortical contributions to auditory cortex, superior parietal cortex, and middle temporal cortex occur. During 100-140 msec, peristriate visual influences and Wernicke's area join in the processing. Resolution of incongruent audio-visual inputs is then attempted, and if successful, McGurk perception occurs and cortical activity in left hemisphere further increases between 170 and 260 msec.
Collapse
Affiliation(s)
- Hillel Pratt
- Evoked Potentials Laboratory Technion - Israel Institute of Technology Haifa 32000 Israel
| | - Naomi Bleich
- Evoked Potentials Laboratory Technion - Israel Institute of Technology Haifa 32000 Israel
| | - Nomi Mittelman
- Evoked Potentials Laboratory Technion - Israel Institute of Technology Haifa 32000 Israel
| |
Collapse
|
27
|
Jaekl P, Pesquita A, Alsius A, Munhall K, Soto-Faraco S. The contribution of dynamic visual cues to audiovisual speech perception. Neuropsychologia 2015; 75:402-10. [PMID: 26100561 DOI: 10.1016/j.neuropsychologia.2015.06.025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 06/11/2015] [Accepted: 06/18/2015] [Indexed: 11/19/2022]
Abstract
Seeing a speaker's facial gestures can significantly improve speech comprehension, especially in noisy environments. However, the nature of the visual information from the speaker's facial movements that is relevant for this enhancement is still unclear. Like auditory speech signals, visual speech signals unfold over time and contain both dynamic configural information and luminance-defined local motion cues; two information sources that are thought to engage anatomically and functionally separate visual systems. Whereas, some past studies have highlighted the importance of local, luminance-defined motion cues in audiovisual speech perception, the contribution of dynamic configural information signalling changes in form over time has not yet been assessed. We therefore attempted to single out the contribution of dynamic configural information to audiovisual speech processing. To this aim, we measured word identification performance in noise using unimodal auditory stimuli, and with audiovisual stimuli. In the audiovisual condition, speaking faces were presented as point light displays achieved via motion capture of the original talker. Point light displays could be isoluminant, to minimise the contribution of effective luminance-defined local motion information, or with added luminance contrast, allowing the combined effect of dynamic configural cues and local motion cues. Audiovisual enhancement was found in both the isoluminant and contrast-based luminance conditions compared to an auditory-only condition, demonstrating, for the first time the specific contribution of dynamic configural cues to audiovisual speech improvement. These findings imply that globally processed changes in a speaker's facial shape contribute significantly towards the perception of articulatory gestures and the analysis of audiovisual speech.
Collapse
Affiliation(s)
- Philip Jaekl
- Center for Visual Science and Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA.
| | - Ana Pesquita
- UBC Vision Lab, Department of Psychology, University of British Colombia, Vancouver, BC, Canada
| | - Agnes Alsius
- Department of Psychology, Queen's University, Kingston, ON, Canada
| | - Kevin Munhall
- Department of Psychology, Queen's University, Kingston, ON, Canada
| | - Salvador Soto-Faraco
- Centre for Brain and Cognition, Department of Information Technology and Communications, Universitat Pompeu Fabra, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), Spain
| |
Collapse
|
28
|
Maddox RK, Atilgan H, Bizley JK, Lee AKC. Auditory selective attention is enhanced by a task-irrelevant temporally coherent visual stimulus in human listeners. eLife 2015; 4:e04995. [PMID: 25654748 PMCID: PMC4337603 DOI: 10.7554/elife.04995] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2014] [Accepted: 12/27/2014] [Indexed: 11/22/2022] Open
Abstract
In noisy settings, listening is aided by correlated dynamic visual cues gleaned from a talker's face-an improvement often attributed to visually reinforced linguistic information. In this study, we aimed to test the effect of audio-visual temporal coherence alone on selective listening, free of linguistic confounds. We presented listeners with competing auditory streams whose amplitude varied independently and a visual stimulus with varying radius, while manipulating the cross-modal temporal relationships. Performance improved when the auditory target's timecourse matched that of the visual stimulus. The fact that the coherence was between task-irrelevant stimulus features suggests that the observed improvement stemmed from the integration of auditory and visual streams into cross-modal objects, enabling listeners to better attend the target. These findings suggest that in everyday conditions, where listeners can often see the source of a sound, temporal cues provided by vision can help listeners to select one sound source from a mixture.
Collapse
Affiliation(s)
- Ross K Maddox
- Institute for Learning and Brain Sciences, University of Washington, Seattle, United States
| | - Huriye Atilgan
- Ear Institute, University College London, London, United Kingdom
| | | | - Adrian KC Lee
- Institute for Learning and Brain Sciences, University of Washington, Seattle, United States
- Department of Speech and Hearing Sciences, University of Washington, Seattle, United States
| |
Collapse
|
29
|
Bernstein LE, Liebenthal E. Neural pathways for visual speech perception. Front Neurosci 2014; 8:386. [PMID: 25520611 PMCID: PMC4248808 DOI: 10.3389/fnins.2014.00386] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 11/10/2014] [Indexed: 12/03/2022] Open
Abstract
This paper examines the questions, what levels of speech can be perceived visually, and how is visual speech represented by the brain? Review of the literature leads to the conclusions that every level of psycholinguistic speech structure (i.e., phonetic features, phonemes, syllables, words, and prosody) can be perceived visually, although individuals differ in their abilities to do so; and that there are visual modality-specific representations of speech qua speech in higher-level vision brain areas. That is, the visual system represents the modal patterns of visual speech. The suggestion that the auditory speech pathway receives and represents visual speech is examined in light of neuroimaging evidence on the auditory speech pathways. We outline the generally agreed-upon organization of the visual ventral and dorsal pathways and examine several types of visual processing that might be related to speech through those pathways, specifically, face and body, orthography, and sign language processing. In this context, we examine the visual speech processing literature, which reveals widespread diverse patterns of activity in posterior temporal cortices in response to visual speech stimuli. We outline a model of the visual and auditory speech pathways and make several suggestions: (1) The visual perception of speech relies on visual pathway representations of speech qua speech. (2) A proposed site of these representations, the temporal visual speech area (TVSA) has been demonstrated in posterior temporal cortex, ventral and posterior to multisensory posterior superior temporal sulcus (pSTS). (3) Given that visual speech has dynamic and configural features, its representations in feedforward visual pathways are expected to integrate these features, possibly in TVSA.
Collapse
Affiliation(s)
- Lynne E Bernstein
- Department of Speech and Hearing Sciences, George Washington University Washington, DC, USA
| | - Einat Liebenthal
- Department of Neurology, Medical College of Wisconsin Milwaukee, WI, USA ; Department of Psychiatry, Brigham and Women's Hospital Boston, MA, USA
| |
Collapse
|
30
|
Strand J, Cooperman A, Rowe J, Simenstad A. Individual differences in susceptibility to the McGurk effect: links with lipreading and detecting audiovisual incongruity. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2014; 57:2322-2331. [PMID: 25296272 DOI: 10.1044/2014_jslhr-h-14-0059] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/18/2014] [Accepted: 09/19/2014] [Indexed: 06/03/2023]
Abstract
PURPOSE Prior studies (e.g., Nath & Beauchamp, 2012) report large individual variability in the extent to which participants are susceptible to the McGurk effect, a prominent audiovisual (AV) speech illusion. The current study evaluated whether susceptibility to the McGurk effect (MGS) is related to lipreading skill and whether multiple measures of MGS that have been used previously are correlated. In addition, it evaluated the test-retest reliability of individual differences in MGS. METHOD Seventy-three college-age participants completed 2 tasks measuring MGS and 3 measures of lipreading skill. Fifty-eight participants returned for a 2nd session (approximately 2 months later) in which MGS was tested again. RESULTS The current study demonstrated that MGS shows high test-retest reliability and is correlated with some measures of lipreading skill. In addition, susceptibility measures derived from identification tasks were moderately related to the ability to detect instances of AV incongruity. CONCLUSIONS Although MGS is often cited as a demonstration of AV integration, the results suggest that perceiving the illusion depends in part on individual differences in lipreading skill and detecting AV incongruity. Therefore, individual differences in susceptibility to the illusion are not solely attributable to individual differences in AV integration ability.
Collapse
|
31
|
Alsius A, Möttönen R, Sams ME, Soto-Faraco S, Tiippana K. Effect of attentional load on audiovisual speech perception: evidence from ERPs. Front Psychol 2014; 5:727. [PMID: 25076922 PMCID: PMC4097954 DOI: 10.3389/fpsyg.2014.00727] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Accepted: 06/23/2014] [Indexed: 11/13/2022] Open
Abstract
Seeing articulatory movements influences perception of auditory speech. This is often reflected in a shortened latency of auditory event-related potentials (ERPs) generated in the auditory cortex. The present study addressed whether this early neural correlate of audiovisual interaction is modulated by attention. We recorded ERPs in 15 subjects while they were presented with auditory, visual, and audiovisual spoken syllables. Audiovisual stimuli consisted of incongruent auditory and visual components known to elicit a McGurk effect, i.e., a visually driven alteration in the auditory speech percept. In a Dual task condition, participants were asked to identify spoken syllables whilst monitoring a rapid visual stream of pictures for targets, i.e., they had to divide their attention. In a Single task condition, participants identified the syllables without any other tasks, i.e., they were asked to ignore the pictures and focus their attention fully on the spoken syllables. The McGurk effect was weaker in the Dual task than in the Single task condition, indicating an effect of attentional load on audiovisual speech perception. Early auditory ERP components, N1 and P2, peaked earlier to audiovisual stimuli than to auditory stimuli when attention was fully focused on syllables, indicating neurophysiological audiovisual interaction. This latency decrement was reduced when attention was loaded, suggesting that attention influences early neural processing of audiovisual speech. We conclude that reduced attention weakens the interaction between vision and audition in speech.
Collapse
Affiliation(s)
- Agnès Alsius
- Psychology Department, Queen's University Kingston, ON, Canada
| | - Riikka Möttönen
- Department of Experimental Psychology, University of Oxford Oxford, UK
| | - Mikko E Sams
- Brain and Mind Laboratory, School of Science, Aalto University Espoo, Finland
| | - Salvador Soto-Faraco
- Institut Català de Recerca i Estudis Avançats Barcelona, Spain ; Brain and Cognition Center, Universitat Pompeu Fabra Barcelona, Spain
| | - Kaisa Tiippana
- Institute of Behavioural Sciences, University of Helsinki Helsinki, Finland
| |
Collapse
|
32
|
Affiliation(s)
- Kaisa Tiippana
- Division of Cognitive Psychology and Neuropsychology, Institute of Behavioural Sciences, University of Helsinki Helsinki, Finland
| |
Collapse
|
33
|
Mitchel AD, Weiss DJ. Visual speech segmentation: using facial cues to locate word boundaries in continuous speech. LANGUAGE AND COGNITIVE PROCESSES 2014; 29:771-780. [PMID: 25018577 PMCID: PMC4091796 DOI: 10.1080/01690965.2013.791703] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Speech is typically a multimodal phenomenon, yet few studies have focused on the exclusive contributions of visual cues to language acquisition. To address this gap, we investigated whether visual prosodic information can facilitate speech segmentation. Previous research has demonstrated that language learners can use lexical stress and pitch cues to segment speech and that learners can extract this information from talking faces. Thus, we created an artificial speech stream that contained minimal segmentation cues and paired it with two synchronous facial displays in which visual prosody was either informative or uninformative for identifying word boundaries. Across three familiarisation conditions (audio stream alone, facial streams alone, and paired audiovisual), learning occurred only when the facial displays were informative to word boundaries, suggesting that facial cues can help learners solve the early challenges of language acquisition.
Collapse
Affiliation(s)
- Aaron D. Mitchel
- Department of Psychology, Bucknell University, Lewisburg, PA 17837, USA
| | - Daniel J. Weiss
- Department of Psychology and Program in Linguistics, The Pennsylvania State University, 643 Moore Building, University Park, PA 16802, USA
| |
Collapse
|
34
|
Viviani P, Figliozzi F, Lacquaniti F. The perception of visible speech: estimation of speech rate and detection of time reversals. Exp Brain Res 2011; 215:141-61. [PMID: 21986668 DOI: 10.1007/s00221-011-2883-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2011] [Accepted: 09/16/2011] [Indexed: 11/29/2022]
Abstract
Four experiments investigated the perception of visible speech. Experiment 1 addressed the perception of speech rate. Observers were shown video-clips of the lower face of actors speaking at their spontaneous rate. Then, they were shown muted versions of the video-clips, which were either accelerated or decelerated. The task (scaling) was to compare visually the speech rate of the stimulus to the spontaneous rate of the actor being shown. Rate estimates were accurate when the video-clips were shown in the normal direction (forward mode). In contrast, speech rate was underestimated when the video-clips were shown in reverse (backward mode). Experiments 2-4 (2AFC) investigated how accurately one discriminates forward and backward speech movements. Unlike in Experiment 1, observers were never exposed to the sound track of the video-clips. Performance was well above chance when playback mode was crossed with rate modulation, and the number of repetitions of the stimuli allowed some amount of speechreading to take place in forward mode (Experiment 2). In Experiment 3, speechreading was made much more difficult by using a different and larger set of muted video-clips. Yet, accuracy decreased only slightly with respect to Experiment 2. Thus, kinematic rather then speechreading cues are most important for discriminating movement direction. Performance worsened, but remained above chance level when the same stimuli of Experiment 3 were rotated upside down (Experiment 4). We argue that the results are in keeping with the hypothesis that visual perception taps into implicit motor competence. Thus, lawful instances of biological movements (forward stimuli) are processed differently from backward stimuli representing movements that the observer cannot perform.
Collapse
Affiliation(s)
- Paolo Viviani
- Laboratory of Neuromotor Physiology, Santa Lucia Foundation, via Ardeatina, 306, 00179, Rome, Italy.
| | | | | |
Collapse
|
35
|
Buchan JN, Munhall KG. The Influence of Selective Attention to Auditory and Visual Speech on the Integration of Audiovisual Speech Information. Perception 2011; 40:1164-82. [DOI: 10.1068/p6939] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
Abstract
Conflicting visual speech information can influence the perception of acoustic speech, causing an illusory percept of a sound not present in the actual acoustic speech (the McGurk effect). We examined whether participants can voluntarily selectively attend to either the auditory or visual modality by instructing participants to pay attention to the information in one modality and to ignore competing information from the other modality. We also examined how performance under these instructions was affected by weakening the influence of the visual information by manipulating the temporal offset between the audio and video channels (experiment 1), and the spatial frequency information present in the video (experiment 2). Gaze behaviour was also monitored to examine whether attentional instructions influenced the gathering of visual information. While task instructions did have an influence on the observed integration of auditory and visual speech information, participants were unable to completely ignore conflicting information, particularly information from the visual stream. Manipulating temporal offset had a more pronounced interaction with task instructions than manipulating the amount of visual information. Participants' gaze behaviour suggests that the attended modality influences the gathering of visual information in audiovisual speech perception.
Collapse
Affiliation(s)
| | - Kevin G Munhall
- Department of Otolaryngology, Queen's University, Kingston, Ontario, Canada
| |
Collapse
|
36
|
Bernstein LE, Jiang J, Pantazis D, Lu ZL, Joshi A. Visual phonetic processing localized using speech and nonspeech face gestures in video and point-light displays. Hum Brain Mapp 2010; 32:1660-76. [PMID: 20853377 DOI: 10.1002/hbm.21139] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2008] [Revised: 06/08/2010] [Accepted: 06/28/2010] [Indexed: 11/09/2022] Open
Abstract
The talking face affords multiple types of information. To isolate cortical sites with responsibility for integrating linguistically relevant visual speech cues, speech and nonspeech face gestures were presented in natural video and point-light displays during fMRI scanning at 3.0T. Participants with normal hearing viewed the stimuli and also viewed localizers for the fusiform face area (FFA), the lateral occipital complex (LOC), and the visual motion (V5/MT) regions of interest (ROIs). The FFA, the LOC, and V5/MT were significantly less activated for speech relative to nonspeech and control stimuli. Distinct activation of the posterior superior temporal sulcus and the adjacent middle temporal gyrus to speech, independent of media, was obtained in group analyses. Individual analyses showed that speech and nonspeech stimuli were associated with adjacent but different activations, with the speech activations more anterior. We suggest that the speech activation area is the temporal visual speech area (TVSA), and that it can be localized with the combination of stimuli used in this study.
Collapse
Affiliation(s)
- Lynne E Bernstein
- Division of Communication and Auditory Neuroscience, House Ear Institute, Los Angeles, California, USA.
| | | | | | | | | |
Collapse
|
37
|
Hietanen JK, Manninen P, Sams M, Surakka V. Does audiovisual speech perception use information about facial configuration? ACTA ACUST UNITED AC 2010. [DOI: 10.1080/09541440126006] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
38
|
Pilling M. Auditory event-related potentials (ERPs) in audiovisual speech perception. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2009; 52:1073-1081. [PMID: 19641083 DOI: 10.1044/1092-4388(2009/07-0276)] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
PURPOSE It has recently been reported (e.g., V. van Wassenhove, K. W. Grant, & D. Poeppel, 2005) that audiovisual (AV) presented speech is associated with an N1/P2 auditory event-related potential (ERP) response that is lower in peak amplitude compared with the responses associated with auditory only (AO) speech. This effect was replicated. Further comparisons were made between ERP responses to AV speech in which the visual and auditory components were in or out of synchrony, to test whether the effect is associated with the operation of integration mechanisms, as has been claimed, or occurs because of other factors such as attention. METHOD ERPs were recorded from participants presented with recordings of unimodal or AV speech syllables in a detection task. RESULTS Comparisons were made between AO and AV speech and between synchronous and asynchronous AV speech. Synchronous AV speech produced an N1/P2 with lower peak amplitudes than with AO speech, unaccounted for by linear superposition of visually evoked responses onto auditory-evoked responses. Asynchronous AV speech produced no amplitude reduction. CONCLUSION The dependence of N1/P2 amplitude reduction on AV synchrony validates it as an electrophysiological marker of AV integration.
Collapse
Affiliation(s)
- Michael Pilling
- MRC Institute of Hearing Research, Nottingham, United Kingdom.
| |
Collapse
|
39
|
Abstract
Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver. When we watch someone speak, how much work is our brain actually doing? How much of this work is facilitated by the structure of speech itself? Our work shows that not only are the visual and auditory components of speech tightly locked (obviating the need for the brain to actively bind such information), this temporal coordination also has a distinct rhythm that is between 2 and 7 Hz. Furthermore, during speech production, the onset of the voice occurs with a delay of between 100 and 300 ms relative to the initial, visible movements of the mouth. These temporal parameters of audiovisual speech are intriguing because they match known properties of neuronal oscillations in the auditory cortex. Thus, given what we already know about the neural processing of speech, the natural features of audiovisual speech signals seem to be optimally structured for their interactions with ongoing brain rhythms in receivers.
Collapse
|
40
|
Cunillera T, Camara E, Laine M, Rodriguez-Fornells A. Speech segmentation is facilitated by visual cues. Q J Exp Psychol (Hove) 2009; 63:260-74. [PMID: 19526435 DOI: 10.1080/17470210902888809] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Evidence from infant studies indicates that language learning can be facilitated by multimodal cues. We extended this observation to adult language learning by studying the effects of simultaneous visual cues (nonassociated object images) on speech segmentation performance. Our results indicate that segmentation of new words from a continuous speech stream is facilitated by simultaneous visual input that it is presented at or near syllables that exhibit the low transitional probability indicative of word boundaries. This indicates that temporal audio-visual contiguity helps in directing attention to word boundaries at the earliest stages of language learning. Off-boundary or arrhythmic picture sequences did not affect segmentation performance, suggesting that the language learning system can effectively disregard noninformative visual information. Detection of temporal contiguity between multimodal stimuli may be useful in both infants and second-language learners not only for facilitating speech segmentation, but also for detecting word-object relationships in natural environments.
Collapse
Affiliation(s)
- Toni Cunillera
- Department of Basic Psychology, University of Barcelona, Barcelona, Spain.
| | | | | | | |
Collapse
|
41
|
Campbell R. The processing of audio-visual speech: empirical and neural bases. Philos Trans R Soc Lond B Biol Sci 2008; 363:1001-10. [PMID: 17827105 PMCID: PMC2606792 DOI: 10.1098/rstb.2007.2155] [Citation(s) in RCA: 162] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this selective review, I outline a number of ways in which seeing the talker affects auditory perception of speech, including, but not confined to, the McGurk effect. To date, studies suggest that all linguistic levels are susceptible to visual influence, and that two main modes of processing can be described: a complementary mode, whereby vision provides information more efficiently than hearing for some under-specified parts of the speech stream, and a correlated mode, whereby vision partially duplicates information about dynamic articulatory patterning.Cortical correlates of seen speech suggest that at the neurological as well as the perceptual level, auditory processing of speech is affected by vision, so that 'auditory speech regions' are activated by seen speech. The processing of natural speech, whether it is heard, seen or heard and seen, activates the perisylvian language regions (left>right). It is highly probable that activation occurs in a specific order. First, superior temporal, then inferior parietal and finally inferior frontal regions (left>right) are activated. There is some differentiation of the visual input stream to the core perisylvian language system, suggesting that complementary seen speech information makes special use of the visual ventral processing stream, while for correlated visual speech, the dorsal processing stream, which is sensitive to visual movement, may be relatively more involved.
Collapse
Affiliation(s)
- Ruth Campbell
- Department of Human Communication Science, University College London, Chandler House, 2 Wakefield Street, London WC1N 1PF, UK.
| |
Collapse
|
42
|
Vatakis A, Spence C. Investigating the effects of inversion on configural processing with an audiovisual temporal-order judgment task. Perception 2008; 37:143-60. [PMID: 18399253 DOI: 10.1068/p5648] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Research has shown that inversion is more detrimental to the perception of faces than to the perception of other types of visual stimuli. Inverting a face results in an impairment of configural information processing that leads to slowed early face processing and reduced accuracy when performance is tested in face recognition tasks. We investigated the effects of inverting speech and non-speech stimuli on audiovisual temporal perception. Upright and inverted audiovisual video clips of a person uttering syllables (experiments 1 and 2), playing musical notes on a piano (experiment 3), or a rhesus monkey producing vocalisations (experiment 4) were presented. Participants made unspeeded temporal-order judgments regarding which modality stream (auditory or visual) appeared to have been presented first. Inverting the visual stream did not have any effect on the sensitivity of temporal discrimination responses in any of the four experiments, thus implying that audiovisual temporal integration is resilient to the effects of orientation in the picture plane. By contrast, the point of subjective simultaneity differed significantly as a function of orientation only for the audiovisual speech stimuli but not for the non-speech stimuli or monkey calls. That is, smaller auditory leads were required for the inverted than for the upright-visual speech stimuli. These results are consistent with the longer processing latencies reported previously when human faces are inverted and demonstrates that the temporal perception of dynamic audiovisual speech can be modulated by changes in the physical properties of the visual speech (ie by changes in orientation).
Collapse
Affiliation(s)
- Argiro Vatakis
- Crossmodal Research Laboratory, Department of Experimental Psychology, University of Oxford, 9 South Parks Road, Oxford OX1 3UD, UK.
| | | |
Collapse
|
43
|
Norrix LW, Plante E, Vance R, Boliek CA. Auditory-visual integration for speech by children with and without specific language impairment. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2007; 50:1639-1651. [PMID: 18055778 DOI: 10.1044/1092-4388(2007/111)] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
PURPOSE It has long been known that children with specific language impairment (SLI) can demonstrate difficulty with auditory speech perception. However, speech perception can also involve the integration of both auditory and visual articulatory information. METHOD Fifty-six preschool children, half with and half without SLI, were studied in order to examine auditory-visual integration. Children watched and listened to video clips of a woman speaking [bi] and [gi]. They also listened to audio clips of [bi], [di], and [gi], produced by the same woman. The effect of visual input on speech perception was tested by presenting an auditory [bi] combined with a visually articulated [gi], which tends to alter the phoneme percept (the McGurk effect). RESULTS Both groups of children performed at ceiling when asked to identify speech tokens in auditory-only and congruent auditory-visual modalities. In the incongruent auditory-visual condition, a stronger McGurk effect was found for the normal language group compared with the children with SLI. CONCLUSION Responses by the children with SLI indicated less impact of visual processing on speech perception than was seen with their normal peers. These results demonstrate that the difficulties with speech perception by SLI children extend beyond the auditory-only modality to include auditory-visual processing as well.
Collapse
Affiliation(s)
- Linda W Norrix
- Department of Speech, Language, and Hearing Sciences, The University of Arizona, P.O. Box 210071, 1131 East 2nd Street, Tucson, AZ 85721-0071, USA.
| | | | | | | |
Collapse
|
44
|
Tremblay C, Champoux F, Voss P, Bacon BA, Lepore F, Théoret H. Speech and non-speech audio-visual illusions: a developmental study. PLoS One 2007; 2:e742. [PMID: 17710142 PMCID: PMC1937019 DOI: 10.1371/journal.pone.0000742] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2007] [Accepted: 07/16/2007] [Indexed: 11/19/2022] Open
Abstract
It is well known that simultaneous presentation of incongruent audio and visual stimuli can lead to illusory percepts. Recent data suggest that distinct processes underlie non-specific intersensory speech as opposed to non-speech perception. However, the development of both speech and non-speech intersensory perception across childhood and adolescence remains poorly defined. Thirty-eight observers aged 5 to 19 were tested on the McGurk effect (an audio-visual illusion involving speech), the Illusory Flash effect and the Fusion effect (two audio-visual illusions not involving speech) to investigate the development of audio-visual interactions and contrast speech vs. non-speech developmental patterns. Whereas the strength of audio-visual speech illusions varied as a direct function of maturational level, performance on non-speech illusory tasks appeared to be homogeneous across all ages. These data support the existence of independent maturational processes underlying speech and non-speech audio-visual illusory effects.
Collapse
Affiliation(s)
- Corinne Tremblay
- Department of Psychology, University of Montreal, Montreal, Canada
- Research Center, Sainte-Justine Hospital, Montreal, Canada
| | - François Champoux
- Speech Language Pathology and Audiology, University of Montreal, Montreal, Canada
| | - Patrice Voss
- Department of Psychology, University of Montreal, Montreal, Canada
| | - Benoit A. Bacon
- Department of Psychology, Bishop's University, Sherbrooke, Quebec, Canada
| | - Franco Lepore
- Department of Psychology, University of Montreal, Montreal, Canada
- Research Center, Sainte-Justine Hospital, Montreal, Canada
| | - Hugo Théoret
- Department of Psychology, University of Montreal, Montreal, Canada
- Research Center, Sainte-Justine Hospital, Montreal, Canada
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
45
|
Fingelkurts AA, Fingelkurts AA, Krause CM. Composition of brain oscillations and their functions in the maintenance of auditory, visual and audio–visual speech percepts: an exploratory study. Cogn Process 2007; 8:183-99. [PMID: 17653780 DOI: 10.1007/s10339-007-0175-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2006] [Revised: 05/18/2007] [Accepted: 06/01/2007] [Indexed: 11/30/2022]
Abstract
In the present exploratory study based on 7 subjects, we examined the composition of magnetoencephalographic (MEG) brain oscillations induced by the presentation of an auditory, visual, and audio-visual stimulus (a talking face) using an oddball paradigm. The composition of brain oscillations were assessed here by analyzing the probability-classification of short-term MEG spectral patterns. The probability index for particular brain oscillations being elicited was dependent on the type and the modality of the sensory percept. The maintenance of the integrated audio-visual percept was accompanied by the unique composition of distributed brain oscillations typical of auditory and visual modality, and the contribution of brain oscillations characteristic for visual modality was dominant. Oscillations around 20 Hz were characteristic for the maintenance of integrated audio-visual percept. Identifying the actual composition of brain oscillations allowed us (1) to distinguish two subjectively/consciously identical mental percepts, and (2) to characterize the types of brain functions involved in the maintenance of the multi-sensory percept.
Collapse
|
46
|
Hamilton RH, Shenton JT, Coslett HB. An acquired deficit of audiovisual speech processing. BRAIN AND LANGUAGE 2006; 98:66-73. [PMID: 16600357 DOI: 10.1016/j.bandl.2006.02.001] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2005] [Revised: 01/24/2006] [Accepted: 02/03/2006] [Indexed: 05/08/2023]
Abstract
We report a 53-year-old patient (AWF) who has an acquired deficit of audiovisual speech integration, characterized by a perceived temporal mismatch between speech sounds and the sight of moving lips. AWF was less accurate on an auditory digit span task with vision of a speaker's face as compared to a condition in which no visual information from the lower face was available. He was slower in matching words to pictures when he saw congruent lip movements compared to no lip movements or non-speech lip movements. Unlike normal controls, he showed no McGurk effect. We propose that multisensory binding of audiovisual language cues can be selectively disrupted.
Collapse
Affiliation(s)
- Roy H Hamilton
- Department of Neurology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | | |
Collapse
|
47
|
Irwin JR, Whalen DH, Fowler CA. A sex difference in visual influence on heard speech. ACTA ACUST UNITED AC 2006; 68:582-92. [PMID: 16933423 DOI: 10.3758/bf03208760] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Reports of sex differences in language processing are inconsistent and are thought to vary by task type and difficulty. In two experiments, we investigated a sex difference in visual influence onheard speech (the McGurk effect). First, incongruent consonant-vowel stimuli were presented where the visual portion of the signal was brief (100 msec) or full (temporally equivalent to the auditory). Second, to determine whether men and women differed in their ability to extract visual speech information from these brief stimuli, the same stimuli were presented to new participants with an additional visual-only (lipread) condition. In both experiments, women showed a significantly greater visual influence on heard speech than did men for the brief visual stimuli. No sex differences for the full stimuli or in the ability to lipread were found. These findings indicate that the more challenging brief visual stimuli elicit sex differences in the processing of audiovisual speech.
Collapse
Affiliation(s)
- Julia R Irwin
- Haskins Laboratories, New Haven, Connecticut 06511, USA.
| | | | | |
Collapse
|
48
|
van Wassenhove V, Grant KW, Poeppel D. Temporal window of integration in auditory-visual speech perception. Neuropsychologia 2006; 45:598-607. [PMID: 16530232 DOI: 10.1016/j.neuropsychologia.2006.01.001] [Citation(s) in RCA: 370] [Impact Index Per Article: 20.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2005] [Revised: 11/28/2005] [Accepted: 01/12/2006] [Indexed: 10/24/2022]
Abstract
Forty-three normal hearing participants were tested in two experiments, which focused on temporal coincidence in auditory visual (AV) speech perception. In these experiments, audio recordings of/pa/and/ba/were dubbed onto video recordings of /ba/or/ga/, respectively (ApVk, AbVg), to produce the illusory "fusion" percepts /ta/, or /da/ [McGurk, H., & McDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746-747]. In Experiment 1, an identification task using McGurk pairs with asynchronies ranging from -467 ms (auditory lead) to +467 ms was conducted. Fusion responses were prevalent over temporal asynchronies from -30 ms to +170 ms and more robust for audio lags. In Experiment 2, simultaneity judgments for incongruent and congruent audiovisual tokens (AdVd, AtVt) were collected. McGurk pairs were more readily judged as asynchronous than congruent pairs. Characteristics of the temporal window over which simultaneity and fusion responses were maximal were quite similar, suggesting the existence of a 200 ms duration asymmetric bimodal temporal integration window.
Collapse
|
49
|
Brancazio L, Best CT, Fowler CA. Visual influences on perception of speech and nonspeech vocal-tract events. LANGUAGE AND SPEECH 2006; 49:21-53. [PMID: 16922061 PMCID: PMC2773261 DOI: 10.1177/00238309060490010301] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
We report four experiments designed to determine whether visual information affects judgments of acoustically-specified nonspeech events as well as speech events (the "McGurk effect"). Previous findings have shown only weak McGurk effects for nonspeech stimuli, whereas strong effects are found for consonants. We used click sounds that serve as consonants in some African languages, but that are perceived as nonspeech by American English listeners. We found a significant McGurk effect for clicks presented in isolation that was much smaller than that found for stop-consonant-vowel syllables. In subsequent experiments, we found strong McGurk effects, comparable to those found for English syllables, for click-vowel syllables, and weak effects, comparable to those found for isolated clicks, for excised release bursts of stop consonants presented in isolation. We interpret these findings as evidence that the potential contributions of speech-specific processes on the McGurk effect are limited, and discuss the results in relation to current explanations for the McGurk effect.
Collapse
Affiliation(s)
- Lawrence Brancazio
- Department of Psychology, Southern Connecticut State University, New Haven 06515, USA.
| | | | | |
Collapse
|
50
|
Norrix LW, Plante E, Vance R. Auditory-visual speech integration by adults with and without language-learning disabilities. JOURNAL OF COMMUNICATION DISORDERS 2006; 39:22-36. [PMID: 15950983 DOI: 10.1016/j.jcomdis.2005.05.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/29/2004] [Revised: 04/19/2005] [Accepted: 05/12/2005] [Indexed: 05/02/2023]
Abstract
UNLABELLED Auditory and auditory-visual (AV) speech perception skills were examined in adults with and without language-learning disabilities (LLD). The AV stimuli consisted of congruent consonant-vowel syllables (auditory and visual syllables matched in terms of syllable being produced) and incongruent McGurk syllables (auditory syllable differed from visual syllable). Although the identification of the auditory and congruent AV syllables was comparable for the two groups, the reaction times to identify all syllables were longer in the LLD compared to the control group. This finding is consistent with previous research demonstrating slower processing in learning disabled individuals. Adults with LLD also provided significantly fewer integration-type or McGurk responses compared with their normal peers when presented with speech tokens representing a mismatch between the auditory and visual signal. These results suggest the poor integration for auditory-visual speech previously documented in children with poor language skills also occurs in adults with LLD. LEARNING OUTCOMES The reader will be able to (1) describe the McGurk effect; (2) describe group differences (language learning disabled and control adults) in auditory and auditory-visual speech perception of consonant-vowel syllables.
Collapse
Affiliation(s)
- Linda W Norrix
- Department of Speech and Hearing Sciences, University of Arizona, P.O. Box 210071, Tucson, AZ 85721-0071, USA.
| | | | | |
Collapse
|