1
|
Ahmed F, Nidiffer AR, Lalor EC. The effect of gaze on EEG measures of multisensory integration in a cocktail party scenario. Front Hum Neurosci 2023; 17:1283206. [PMID: 38162285 PMCID: PMC10754997 DOI: 10.3389/fnhum.2023.1283206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 11/20/2023] [Indexed: 01/03/2024] Open
Abstract
Seeing the speaker's face greatly improves our speech comprehension in noisy environments. This is due to the brain's ability to combine the auditory and the visual information around us, a process known as multisensory integration. Selective attention also strongly influences what we comprehend in scenarios with multiple speakers-an effect known as the cocktail-party phenomenon. However, the interaction between attention and multisensory integration is not fully understood, especially when it comes to natural, continuous speech. In a recent electroencephalography (EEG) study, we explored this issue and showed that multisensory integration is enhanced when an audiovisual speaker is attended compared to when that speaker is unattended. Here, we extend that work to investigate how this interaction varies depending on a person's gaze behavior, which affects the quality of the visual information they have access to. To do so, we recorded EEG from 31 healthy adults as they performed selective attention tasks in several paradigms involving two concurrently presented audiovisual speakers. We then modeled how the recorded EEG related to the audio speech (envelope) of the presented speakers. Crucially, we compared two classes of model - one that assumed underlying multisensory integration (AV) versus another that assumed two independent unisensory audio and visual processes (A+V). This comparison revealed evidence of strong attentional effects on multisensory integration when participants were looking directly at the face of an audiovisual speaker. This effect was not apparent when the speaker's face was in the peripheral vision of the participants. Overall, our findings suggest a strong influence of attention on multisensory integration when high fidelity visual (articulatory) speech information is available. More generally, this suggests that the interplay between attention and multisensory integration during natural audiovisual speech is dynamic and is adaptable based on the specific task and environment.
Collapse
Affiliation(s)
| | | | - Edmund C. Lalor
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, and Center for Visual Science, University of Rochester, Rochester, NY, United States
| |
Collapse
|
2
|
Ahmed F, Nidiffer AR, Lalor EC. The effect of gaze on EEG measures of multisensory integration in a cocktail party scenario. bioRxiv 2023:2023.08.23.554451. [PMID: 37662393 PMCID: PMC10473711 DOI: 10.1101/2023.08.23.554451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Seeing the speaker's face greatly improves our speech comprehension in noisy environments. This is due to the brain's ability to combine the auditory and the visual information around us, a process known as multisensory integration. Selective attention also strongly influences what we comprehend in scenarios with multiple speakers - an effect known as the cocktail-party phenomenon. However, the interaction between attention and multisensory integration is not fully understood, especially when it comes to natural, continuous speech. In a recent electroencephalography (EEG) study, we explored this issue and showed that multisensory integration is enhanced when an audiovisual speaker is attended compared to when that speaker is unattended. Here, we extend that work to investigate how this interaction varies depending on a person's gaze behavior, which affects the quality of the visual information they have access to. To do so, we recorded EEG from 31 healthy adults as they performed selective attention tasks in several paradigms involving two concurrently presented audiovisual speakers. We then modeled how the recorded EEG related to the audio speech (envelope) of the presented speakers. Crucially, we compared two classes of model - one that assumed underlying multisensory integration (AV) versus another that assumed two independent unisensory audio and visual processes (A+V). This comparison revealed evidence of strong attentional effects on multisensory integration when participants were looking directly at the face of an audiovisual speaker. This effect was not apparent when the speaker's face was in the peripheral vision of the participants. Overall, our findings suggest a strong influence of attention on multisensory integration when high fidelity visual (articulatory) speech information is available. More generally, this suggests that the interplay between attention and multisensory integration during natural audiovisual speech is dynamic and is adaptable based on the specific task and environment.
Collapse
Affiliation(s)
- Farhin Ahmed
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, and Center for Visual Science, University of Rochester, Rochester, NY 14627, USA
| | - Aaron R. Nidiffer
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, and Center for Visual Science, University of Rochester, Rochester, NY 14627, USA
| | - Edmund C. Lalor
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, and Center for Visual Science, University of Rochester, Rochester, NY 14627, USA
| |
Collapse
|
3
|
Hoffmann AP, Moldwin MB. Separation of Spacecraft Noise From Geomagnetic Field Observations Through Density-Based Cluster Analysis and Compressive Sensing. J Geophys Res Space Phys 2022; 127:e2022JA030757. [PMID: 36245706 PMCID: PMC9541872 DOI: 10.1029/2022ja030757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 08/16/2022] [Accepted: 08/31/2022] [Indexed: 06/16/2023]
Abstract
The use of magnetometers for space exploration is inhibited by magnetic noise generated by spacecraft electrical systems. Mechanical booms are traditionally used to extend magnetometers away from noise sources. If a spacecraft is equipped with multiple magnetometers, signal processing algorithms can be used to compare magnetometer measurements and remove stray magnetic noise signals. We propose the use of density-based cluster analysis to identify spacecraft noise signals and compressive sensing to separate spacecraft noise from geomagnetic field data. This method assumes no prior knowledge of the number, location, or amplitude of noise signals, but assumes that they have minimal overlapping spectral properties. We demonstrate the validity of this algorithm by separating high latitude magnetic perturbations recorded by the low-Earth orbiting satellite, SWARM, from noise signals in simulation and in a laboratory experiment using a mock CubeSat apparatus. In the case of more noise sources than magnetometers, this problem is an instance of underdetermined blind source separation (UBSS). This work presents a UBSS signal processing algorithm to remove spacecraft noise and minimize the need for a mechanical boom.
Collapse
Affiliation(s)
- Alex Paul Hoffmann
- Climate and Space Sciences and EngineeringUniversity of MichiganAnn ArborMIUSA
| | - Mark B. Moldwin
- Climate and Space Sciences and EngineeringUniversity of MichiganAnn ArborMIUSA
| |
Collapse
|
4
|
Zhou H, Wang N, Zheng N, Yu G, Meng Q. A New Approach for Noise Suppression in Cochlear Implants: A Single-Channel Noise Reduction Algorithm. Front Neurosci 2020; 14:301. [PMID: 32372902 PMCID: PMC7186595 DOI: 10.3389/fnins.2020.00301] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2019] [Accepted: 03/16/2020] [Indexed: 12/11/2022] Open
Abstract
The cochlea “translates” the in-air vibrational acoustic “language” into the spikes of neural “language” that are then transmitted to the brain for auditory understanding and/or perception. During this intracochlear “translation” process, high resolution in time–frequency–intensity domains guarantees the high quality of the input neural information for the brain, which is vital for our outstanding hearing abilities. However, cochlear implants (CIs) have coarse artificial coding and interfaces, and CI users experience more challenges in common acoustic environments than their normal-hearing (NH) peers. Noise from sound sources that a listener has no interest in may be neglected by NH listeners, but they may distract a CI user. We discuss the CI noise-suppression techniques and introduce noise management for a new implant system. The monaural signal-to-noise ratio estimation-based noise suppression algorithm “eVoice,” which is incorporated in the processors of Nurotron® EnduroTM, was evaluated in two speech perception experiments. The results show that speech intelligibility in stationary speech-shaped noise can be significantly improved with eVoice. Similar results have been observed in other CI devices with single-channel noise reduction techniques. Specifically, the mean speech reception threshold decrease in the present study was 2.2 dB. The Nurotron society already has more than 10,000 users, and eVoice is a start for noise management in the new system. Future steps on non-stationary-noise suppression, spatial-source separation, bilateral hearing, microphone configuration, and environment specification are warranted. The existing evidence, including our research, suggests that noise-suppression techniques should be applied in CI systems. The artificial hearing of CI listeners requires more advanced signal processing techniques to reduce brain effort and increase intelligibility in noisy settings.
Collapse
Affiliation(s)
- Huali Zhou
- Acoustics Lab, School of Physics and Optoelectronics, South China University of Technology, Guangzhou, China
| | | | - Nengheng Zheng
- The Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Guangzheng Yu
- Acoustics Lab, School of Physics and Optoelectronics, South China University of Technology, Guangzhou, China
| | - Qinglin Meng
- Acoustics Lab, School of Physics and Optoelectronics, South China University of Technology, Guangzhou, China
| |
Collapse
|
5
|
Rennies J, Best V, Roverud E, Kidd G. Energetic and Informational Components of Speech-on-Speech Masking in Binaural Speech Intelligibility and Perceived Listening Effort. Trends Hear 2019; 23:2331216519854597. [PMID: 31172880 PMCID: PMC6557024 DOI: 10.1177/2331216519854597] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Speech perception in complex sound fields can greatly benefit from different unmasking cues to segregate the target from interfering voices. This study investigated the role of three unmasking cues (spatial separation, gender differences, and masker time reversal) on speech intelligibility and perceived listening effort in normal-hearing listeners. Speech intelligibility and categorically scaled listening effort were measured for a female target talker masked by two competing talkers with no unmasking cues or one to three unmasking cues. In addition to natural stimuli, all measurements were also conducted with glimpsed speech—which was created by removing the time–frequency tiles of the speech mixture in which the maskers dominated the mixture—to estimate the relative amounts of informational and energetic masking as well as the effort associated with source segregation. The results showed that all unmasking cues as well as glimpsing improved intelligibility and reduced listening effort and that providing more than one cue was beneficial in overcoming informational masking. The reduction in listening effort due to glimpsing corresponded to increases in signal-to-noise ratio of 8 to 18 dB, indicating that a significant amount of listening effort was devoted to segregating the target from the maskers. Furthermore, the benefit in listening effort for all unmasking cues extended well into the range of positive signal-to-noise ratios at which speech intelligibility was at ceiling, suggesting that listening effort is a useful tool for evaluating speech-on-speech masking conditions at typical conversational levels.
Collapse
Affiliation(s)
- Jan Rennies
- 1 Department of Speech, Language and Hearing Sciences, Boston University, MA, USA.,2 Fraunhofer Institute for Digital Media Technology IDMT, Project Group Hearing, Speech and Audio Technology, Oldenburg, Germany.,3 Cluster of Excellence Hearing4all, Carl-von-Ossietzky University, Oldenburg, Germany
| | - Virginia Best
- 1 Department of Speech, Language and Hearing Sciences, Boston University, MA, USA
| | - Elin Roverud
- 1 Department of Speech, Language and Hearing Sciences, Boston University, MA, USA
| | - Gerald Kidd
- 1 Department of Speech, Language and Hearing Sciences, Boston University, MA, USA
| |
Collapse
|
6
|
Abstract
Events and objects in the world must be inferred from sensory signals to support behavior. Because sensory measurements are temporally and spatially local, the estimation of an object or event can be viewed as the grouping of these measurements into representations of their common causes. Perceptual grouping is believed to reflect internalized regularities of the natural environment, yet grouping cues have traditionally been identified using informal observation and investigated using artificial stimuli. The relationship of grouping to natural signal statistics has thus remained unclear, and additional or alternative cues remain possible. Here, we develop a general methodology for relating grouping to natural sensory signals and apply it to derive auditory grouping cues from natural sounds. We first learned local spectrotemporal features from natural sounds and measured their co-occurrence statistics. We then learned a small set of stimulus properties that could predict the measured feature co-occurrences. The resulting cues included established grouping cues, such as harmonic frequency relationships and temporal coincidence, but also revealed previously unappreciated grouping principles. Human perceptual grouping was predicted by natural feature co-occurrence, with humans relying on the derived grouping cues in proportion to their informativity about co-occurrence in natural sounds. The results suggest that auditory grouping is adapted to natural stimulus statistics, show how these statistics can reveal previously unappreciated grouping phenomena, and provide a framework for studying grouping in natural signals.
Collapse
|
7
|
Chou KF, Dong J, Colburn HS, Sen K. A Physiologically Inspired Model for Solving the Cocktail Party Problem. J Assoc Res Otolaryngol 2019; 20:579-93. [PMID: 31392449 DOI: 10.1007/s10162-019-00732-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Accepted: 07/18/2019] [Indexed: 11/05/2022] Open
Abstract
At a cocktail party, we can broadly monitor the entire acoustic scene to detect important cues (e.g., our names being called, or the fire alarm going off), or selectively listen to a target sound source (e.g., a conversation partner). It has recently been observed that individual neurons in the avian field L (analog to the mammalian auditory cortex) can display broad spatial tuning to single targets and selective tuning to a target embedded in spatially distributed sound mixtures. Here, we describe a model inspired by these experimental observations and apply it to process mixtures of human speech sentences. This processing is realized in the neural spiking domain. It converts binaural acoustic inputs into cortical spike trains using a multi-stage model composed of a cochlear filter-bank, a midbrain spatial-localization network, and a cortical network. The output spike trains of the cortical network are then converted back into an acoustic waveform, using a stimulus reconstruction technique. The intelligibility of the reconstructed output is quantified using an objective measure of speech intelligibility. We apply the algorithm to single and multi-talker speech to demonstrate that the physiologically inspired algorithm is able to achieve intelligible reconstruction of an “attended” target sentence embedded in two other non-attended masker sentences. The algorithm is also robust to masker level and displays performance trends comparable to humans. The ideas from this work may help improve the performance of hearing assistive devices (e.g., hearing aids and cochlear implants), speech-recognition technology, and computational algorithms for processing natural scenes cluttered with spatially distributed acoustic objects.
Collapse
|
8
|
Puvvada KC, Simon JZ. Cortical Representations of Speech in a Multitalker Auditory Scene. J Neurosci 2017; 37:9189-96. [PMID: 28821680 DOI: 10.1523/JNEUROSCI.0938-17.2017] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Revised: 07/20/2017] [Accepted: 08/08/2017] [Indexed: 11/21/2022] Open
Abstract
The ability to parse a complex auditory scene into perceptual objects is facilitated by a hierarchical auditory system. Successive stages in the hierarchy transform an auditory scene of multiple overlapping sources, from peripheral tonotopically based representations in the auditory nerve, into perceptually distinct auditory-object-based representations in the auditory cortex. Here, using magnetoencephalography recordings from men and women, we investigate how a complex acoustic scene consisting of multiple speech sources is represented in distinct hierarchical stages of the auditory cortex. Using systems-theoretic methods of stimulus reconstruction, we show that the primary-like areas in the auditory cortex contain dominantly spectrotemporal-based representations of the entire auditory scene. Here, both attended and ignored speech streams are represented with almost equal fidelity, and a global representation of the full auditory scene with all its streams is a better candidate neural representation than that of individual streams being represented separately. We also show that higher-order auditory cortical areas, by contrast, represent the attended stream separately and with significantly higher fidelity than unattended streams. Furthermore, the unattended background streams are more faithfully represented as a single unsegregated background object rather than as separated objects. Together, these findings demonstrate the progression of the representations and processing of a complex acoustic scene up through the hierarchy of the human auditory cortex.SIGNIFICANCE STATEMENT Using magnetoencephalography recordings from human listeners in a simulated cocktail party environment, we investigate how a complex acoustic scene consisting of multiple speech sources is represented in separate hierarchical stages of the auditory cortex. We show that the primary-like areas in the auditory cortex use a dominantly spectrotemporal-based representation of the entire auditory scene, with both attended and unattended speech streams represented with almost equal fidelity. We also show that higher-order auditory cortical areas, by contrast, represent an attended speech stream separately from, and with significantly higher fidelity than, unattended speech streams. Furthermore, the unattended background streams are represented as a single undivided background object rather than as distinct background objects.
Collapse
|
9
|
Abstract
Noninvasive EEG (electroencephalography) based auditory attention detection could be useful for improved hearing aids in the future. This work is a novel attempt to investigate the feasibility of online modulation of sound sources by probabilistic detection of auditory attention, using a noninvasive EEG-based brain computer interface. Proposed online system modulates the upcoming sound sources through gain adaptation which employs probabilistic decisions (soft decisions) from a classifier trained on offline calibration data. In this work, calibration EEG data were collected in sessions where the participants listened to two sound sources (one attended and one unattended). Cross-correlation coefficients between the EEG measurements and the attended and unattended sound source envelope (estimates) are used to show differences in sharpness and delays of neural responses for attended versus unattended sound source. Salient features to distinguish attended sources from the unattended ones in the correlation patterns have been identified, and later they have been used to train an auditory attention classifier. Using this classifier, we have shown high offline detection performance with single channel EEG measurements compared to the existing approaches in the literature which employ large number of channels. In addition, using the classifier trained offline in the calibration session, we have shown the performance of the online sound source modulation system. We observe that online sound source modulation system is able to keep the level of attended sound source higher than the unattended source.
Collapse
Affiliation(s)
| | | | - Murat Akcakaya
- University of Pittsburgh, 4200 Fifth Ave, Pittsburgh, PA 15260
| | - Deniz Erdogmus
- Northeastern University, 360 Huntington Ave, Boston, MA 02115
| |
Collapse
|
10
|
Lee N, Ward JL, Vélez A, Micheyl C, Bee MA. Frogs Exploit Statistical Regularities in Noisy Acoustic Scenes to Solve Cocktail-Party-like Problems. Curr Biol 2017; 27:743-750. [PMID: 28238657 DOI: 10.1016/j.cub.2017.01.031] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Revised: 01/13/2017] [Accepted: 01/18/2017] [Indexed: 11/30/2022]
Abstract
Noise is a ubiquitous source of errors in all forms of communication [1]. Noise-induced errors in speech communication, for example, make it difficult for humans to converse in noisy social settings, a challenge aptly named the "cocktail party problem" [2]. Many nonhuman animals also communicate acoustically in noisy social groups and thus face biologically analogous problems [3]. However, we know little about how the perceptual systems of receivers are evolutionarily adapted to avoid the costs of noise-induced errors in communication. In this study of Cope's gray treefrog (Hyla chrysoscelis; Hylidae), we investigated whether receivers exploit a potential statistical regularity present in noisy acoustic scenes to reduce errors in signal recognition and discrimination. We developed an anatomical/physiological model of the peripheral auditory system to show that temporal correlation in amplitude fluctuations across the frequency spectrum ("comodulation") [4-6] is a feature of the noise generated by large breeding choruses of sexually advertising males. In four psychophysical experiments, we investigated whether females exploit comodulation in background noise to mitigate noise-induced errors in evolutionarily critical mate-choice decisions. Subjects experienced fewer errors in recognizing conspecific calls and in selecting the calls of high-quality mates in the presence of simulated chorus noise that was comodulated. These data show unequivocally, and for the first time, that exploiting statistical regularities present in noisy acoustic scenes is an important biological strategy for solving cocktail-party-like problems in nonhuman animal communication.
Collapse
Affiliation(s)
- Norman Lee
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN 55108, USA.
| | - Jessica L Ward
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN 55108, USA; Department of Fisheries, Wildlife, and Conservation Biology, University of Minnesota, Saint Paul, MN 55108, USA
| | - Alejandro Vélez
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN 55108, USA
| | - Christophe Micheyl
- Department of Psychology, University of Minnesota, Minneapolis, MN 55455, USA
| | - Mark A Bee
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN 55108, USA; Graduate Program in Neuroscience, University of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|
11
|
Caldwell MS, Lee N, Bee MA. Inherent Directionality Determines Spatial Release from Masking at the Tympanum in a Vertebrate with Internally Coupled Ears. J Assoc Res Otolaryngol 2016; 17:259-70. [PMID: 27125545 DOI: 10.1007/s10162-016-0568-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2015] [Accepted: 04/10/2016] [Indexed: 10/21/2022] Open
Abstract
In contrast to humans and other mammals, many animals have internally coupled ears that function as inherently directional pressure-gradient receivers. Two important but unanswered questions are to what extent and how do animals with such ears exploit spatial cues in the perceptual analysis of noisy and complex acoustic scenes? This study of Cope's gray treefrog (Hyla chrysoscelis) investigated how the inherent directionality of internally coupled ears contributes to spatial release from masking. We used laser vibrometry and signal detection theory to determine the threshold signal-to-noise ratio at which the tympanum's response to vocalizations could be reliably detected in noise. Thresholds were determined as a function of signal location, noise location, and signal-noise separation. Vocalizations were broadcast from one of three azimuthal locations: frontal (0 °), to the right (+90 °), and to the left (-90 °). Masking noise was broadcast from each of 12 azimuthal angles around the frog (0 to 330 °, 30 ° separation). Variation in the position of the noise source resulted in, on average, 4 dB of spatial release from masking relative to co-located conditions. However, detection thresholds could be up to 9 dB lower in the "best ear for listening" compared to the other ear. The pattern and magnitude of spatial release from masking were well predicted by the tympanum's inherent directionality. We discuss how the magnitude of masking release observed in the tympanum's response to spatially separated signals and noise relates to that observed in previous behavioral and neurophysiological studies of frog hearing and communication.
Collapse
|
12
|
Dong J, Colburn HS, Sen K. Cortical Transformation of Spatial Processing for Solving the Cocktail Party Problem: A Computational Model(1,2,3). eNeuro 2016; 3:ENEURO. [PMID: 26866056 DOI: 10.1523/ENEURO.0086-15.2015] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2015] [Revised: 12/16/2015] [Accepted: 12/18/2015] [Indexed: 12/04/2022] Open
Abstract
In multisource, “cocktail party” sound environments, human and animal auditory systems can use spatial cues to effectively separate and follow one source of sound over competing sources. In multisource, “cocktail party” sound environments, human and animal auditory systems can use spatial cues to effectively separate and follow one source of sound over competing sources. While mechanisms to extract spatial cues such as interaural time differences (ITDs) are well understood in precortical areas, how such information is reused and transformed in higher cortical regions to represent segregated sound sources is not clear. We present a computational model describing a hypothesized neural network that spans spatial cue detection areas and the cortex. This network is based on recent physiological findings that cortical neurons selectively encode target stimuli in the presence of competing maskers based on source locations (Maddox et al., 2012). We demonstrate that key features of cortical responses can be generated by the model network, which exploits spatial interactions between inputs via lateral inhibition, enabling the spatial separation of target and interfering sources while allowing monitoring of a broader acoustic space when there is no competition. We present the model network along with testable experimental paradigms as a starting point for understanding the transformation and organization of spatial information from midbrain to cortex. This network is then extended to suggest engineering solutions that may be useful for hearing-assistive devices in solving the cocktail party problem.
Collapse
|
13
|
Thakur CS, Wang RM, Afshar S, Hamilton TJ, Tapson JC, Shamma SA, van Schaik A. Sound stream segregation: a neuromorphic approach to solve the " cocktail party problem" in real-time. Front Neurosci 2015; 9:309. [PMID: 26388721 PMCID: PMC4557082 DOI: 10.3389/fnins.2015.00309] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Accepted: 08/18/2015] [Indexed: 11/13/2022] Open
Abstract
The human auditory system has the ability to segregate complex auditory scenes into a foreground component and a background, allowing us to listen to specific speech sounds from a mixture of sounds. Selective attention plays a crucial role in this process, colloquially known as the "cocktail party effect." It has not been possible to build a machine that can emulate this human ability in real-time. Here, we have developed a framework for the implementation of a neuromorphic sound segregation algorithm in a Field Programmable Gate Array (FPGA). This algorithm is based on the principles of temporal coherence and uses an attention signal to separate a target sound stream from background noise. Temporal coherence implies that auditory features belonging to the same sound source are coherently modulated and evoke highly correlated neural response patterns. The basis for this form of sound segregation is that responses from pairs of channels that are strongly positively correlated belong to the same stream, while channels that are uncorrelated or anti-correlated belong to different streams. In our framework, we have used a neuromorphic cochlea as a frontend sound analyser to extract spatial information of the sound input, which then passes through band pass filters that extract the sound envelope at various modulation rates. Further stages include feature extraction and mask generation, which is finally used to reconstruct the targeted sound. Using sample tonal and speech mixtures, we show that our FPGA architecture is able to segregate sound sources in real-time. The accuracy of segregation is indicated by the high signal-to-noise ratio (SNR) of the segregated stream (90, 77, and 55 dB for simple tone, complex tone, and speech, respectively) as compared to the SNR of the mixture waveform (0 dB). This system may be easily extended for the segregation of complex speech signals, and may thus find various applications in electronic devices such as for sound segregation and speech recognition.
Collapse
Affiliation(s)
- Chetan Singh Thakur
- Biomedical Engineering and Neuroscience, The MARCS Institute, University of Western SydneySydney, NSW, Australia
| | - Runchun M. Wang
- Biomedical Engineering and Neuroscience, The MARCS Institute, University of Western SydneySydney, NSW, Australia
| | - Saeed Afshar
- Biomedical Engineering and Neuroscience, The MARCS Institute, University of Western SydneySydney, NSW, Australia
| | - Tara J. Hamilton
- Biomedical Engineering and Neuroscience, The MARCS Institute, University of Western SydneySydney, NSW, Australia
| | - Jonathan C. Tapson
- Biomedical Engineering and Neuroscience, The MARCS Institute, University of Western SydneySydney, NSW, Australia
| | - Shihab A. Shamma
- Department of Electrical and Computer Engineering and Institute for Systems Research, University of MarylandCollege Park, MD, USA
| | - André van Schaik
- Biomedical Engineering and Neuroscience, The MARCS Institute, University of Western SydneySydney, NSW, Australia
| |
Collapse
|
14
|
Xie Y, Tsai TH, Konneker A, Popa BI, Brady DJ, Cummer SA. Single-sensor multispeaker listening with acoustic metamaterials. Proc Natl Acad Sci U S A 2015; 112:10595-8. [PMID: 26261314 DOI: 10.1073/pnas.1502276112] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Designing a "cocktail party listener" that functionally mimics the selective perception of a human auditory system has been pursued over the past decades. By exploiting acoustic metamaterials and compressive sensing, we present here a single-sensor listening device that separates simultaneous overlapping sounds from different sources. The device with a compact array of resonant metamaterials is demonstrated to distinguish three overlapping and independent sources with 96.67% correct audio recognition. Segregation of the audio signals is achieved using physical layer encoding without relying on source characteristics. This hardware approach to multichannel source separation can be applied to robust speech recognition and hearing aids and may be extended to other acoustic imaging and sensing applications.
Collapse
|
15
|
Kondo HM, Toshima I, Pressnitzer D, Kashino M. Probing the time course of head-motion cues integration during auditory scene analysis. Front Neurosci 2014; 8:170. [PMID: 25009456 PMCID: PMC4067593 DOI: 10.3389/fnins.2014.00170] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Accepted: 06/04/2014] [Indexed: 11/13/2022] Open
Abstract
The perceptual organization of auditory scenes is a hard but important problem to solve for human listeners. It is thus likely that cues from several modalities are pooled for auditory scene analysis, including sensory-motor cues related to the active exploration of the scene. We previously reported a strong effect of head motion on auditory streaming. Streaming refers to an experimental paradigm where listeners hear sequences of pure tones, and rate their perception of one or more subjective sources called streams. To disentangle the effects of head motion (changes in acoustic cues at the ear, subjective location cues, and motor cues), we used a robotic telepresence system, Telehead. We found that head motion induced perceptual reorganization even when the acoustic scene had not changed. Here we reanalyzed the same data to probe the time course of sensory-motor integration. We show that motor cues had a different time course compared to acoustic or subjective location cues: motor cues impacted perceptual organization earlier and for a shorter time than other cues, with successive positive and negative contributions to streaming. An additional experiment controlled for the effects of volitional anticipatory components, and found that arm or leg movements did not have any impact on scene analysis. These data provide a first investigation of the time course of the complex integration of sensory-motor cues in an auditory scene analysis task, and they suggest a loose temporal coupling between the different mechanisms involved.
Collapse
Affiliation(s)
- Hirohito M Kondo
- NTT Communication Science Laboratories, NTT Corporation Atsugi, Japan ; Department of Child Development, United Graduate School of Child Development, Osaka University, Kanazawa University, Hamamatsu University School of Medicine, Chiba University, and University of Fukui Suita, Japan
| | - Iwaki Toshima
- NTT Communication Science Laboratories, NTT Corporation Atsugi, Japan
| | - Daniel Pressnitzer
- Laboratoire des Systèmes Perceptifs, CNRS UMR 8248 Paris, France ; Département d'études cognitives, École normale supérieure Paris, France
| | - Makio Kashino
- NTT Communication Science Laboratories, NTT Corporation Atsugi, Japan ; Department of Information Processing, Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology Yokohama, Japan
| |
Collapse
|
16
|
Abstract
Auditory cortical activity is entrained to the temporal envelope of speech, which corresponds to the syllabic rhythm of speech. Such entrained cortical activity can be measured from subjects naturally listening to sentences or spoken passages, providing a reliable neural marker of online speech processing. A central question still remains to be answered about whether cortical entrained activity is more closely related to speech perception or non-speech-specific auditory encoding. Here, we review a few hypotheses about the functional roles of cortical entrainment to speech, e.g., encoding acoustic features, parsing syllabic boundaries, and selecting sensory information in complex listening environments. It is likely that speech entrainment is not a homogeneous response and these hypotheses apply separately for speech entrainment generated from different neural sources. The relationship between entrained activity and speech intelligibility is also discussed. A tentative conclusion is that theta-band entrainment (4–8 Hz) encodes speech features critical for intelligibility while delta-band entrainment (1–4 Hz) is related to the perceived, non-speech-specific acoustic rhythm. To further understand the functional properties of speech entrainment, a splitter’s approach will be needed to investigate (1) not just the temporal envelope but what specific acoustic features are encoded and (2) not just speech intelligibility but what specific psycholinguistic processes are encoded by entrained cortical activity. Similarly, the anatomical and spectro-temporal details of entrained activity need to be taken into account when investigating its functional properties.
Collapse
Affiliation(s)
- Nai Ding
- Department of Psychology, New York University New York, NY, USA
| | - Jonathan Z Simon
- Department of Electrical and Computer Engineering, University of Maryland College Park, College Park MD, USA ; Department of Biology, University of Maryland College Park, College Park MD, USA ; Institute for Systems Research, University of Maryland College Park, College Park MD, USA
| |
Collapse
|