1
|
Hamersky GR, Shaheen LA, Espejo ML, Wingert JC, David SV. Reduced Neural Responses to Natural Foreground versus Background Sounds in the Auditory Cortex. J Neurosci 2025; 45:e0121242024. [PMID: 39837664 PMCID: PMC11884389 DOI: 10.1523/jneurosci.0121-24.2024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 11/27/2024] [Accepted: 12/03/2024] [Indexed: 01/23/2025] Open
Abstract
In everyday hearing, listeners face the challenge of understanding behaviorally relevant foreground stimuli (speech, vocalizations) in complex backgrounds (environmental, mechanical noise). Prior studies have shown that high-order areas of human auditory cortex (AC) preattentively form an enhanced representation of foreground stimuli in the presence of background noise. This enhancement requires identifying and grouping the features that comprise the background so they can be removed from the foreground representation. To study the cortical computations supporting this process, we recorded single-unit activity in AC of male and female ferrets during the presentation of concurrent natural sounds from these two categories. In contrast to expectations from studies in high-order AC, single-unit responses to foreground sounds were strongly reduced relative to the paired background in primary and secondary AC. The degree of reduction could not be explained by a neuron's preference for the foreground or background stimulus in isolation but could be partially explained by spectrotemporal statistics that distinguish foreground and background categories. Responses to synthesized sounds with statistics either matched or randomized relative to natural sounds showed progressively decreased reduction of foreground responses as natural sound statistics were removed. These results challenge the expectation that cortical foreground representations emerge directly from a mixed representation in the auditory periphery. Instead, they suggest the early AC maintains a robust representation of background noise. Strong background representations may produce a distributed code, facilitating selection of foreground signals from a relatively small subpopulation of AC neurons at later processing stages.
Collapse
Affiliation(s)
- Gregory R Hamersky
- Neurosicence Graduate Program, Oregon Health and Science University, Portland, Oregon 97239
- Oregon Hearing Research Center, Oregon Health and Science University, Portland, Oregon 97239
| | - Luke A Shaheen
- Oregon Hearing Research Center, Oregon Health and Science University, Portland, Oregon 97239
| | - Mateo López Espejo
- Neurosicence Graduate Program, Oregon Health and Science University, Portland, Oregon 97239
- Oregon Hearing Research Center, Oregon Health and Science University, Portland, Oregon 97239
| | - Jereme C Wingert
- Oregon Hearing Research Center, Oregon Health and Science University, Portland, Oregon 97239
- Behavioral and Systems Neuroscience Graduate Program, Oregon Health and Science University, Portland, Oregon 97239
| | - Stephen V David
- Oregon Hearing Research Center, Oregon Health and Science University, Portland, Oregon 97239
| |
Collapse
|
2
|
Kong F, Zhou H, Zheng N, Meng Q. Sparse representation of speech using an atomic speech modela). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2025; 157:1899-1911. [PMID: 40106275 DOI: 10.1121/10.0036144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 02/20/2025] [Indexed: 03/22/2025]
Abstract
Speech perception has been extensively studied using degradation algorithms such as channel vocoding, mosaic speech, and pointillistic speech. Here, an "atomic speech model" is introduced to generate unique sparse time-frequency patterns. It processes speech signals using a bank of bandpass filters, undersamples the signals, and reproduces each sample using a Gaussian-enveloped tone (a Gabor atom). To examine atomic speech intelligibility, adaptive speech reception thresholds (SRTs) are measured as a function of atom rate in normal-hearing listeners, investigating the effects of spectral maxima, binaural integration, and single echo. Experiment 1 showed atomic speech with 4 spectral maxima out of 32 bands remained intelligible even at a low rate under 80 atoms per second. Experiment 2 showed that when atoms were nonoverlappingly assigned to both ears, the mean SRT increased (i.e., worsened) compared to the monaural condition, where all atoms were assigned to one ear. Individual data revealed that a few listeners could integrate information from both ears, performing comparably to the monaural condition. Experiment 3 indicated higher mean SRT with a 100 ms echo delay than that with shorter delays (e.g., 50, 25, and 0 ms). These findings demonstrate the utility of the atomic speech model for investigating speech perception and its underlying mechanisms.
Collapse
Affiliation(s)
- Fanhui Kong
- School of Information Engineering, Guangzhou Panyu Polytechnic, Guangzhou, Guangdong 410630, China
- Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, Guangdong 518052, China
| | - Huali Zhou
- School of Electronics and Information Engineering, Heyuan Polytechnic, Heyuan, Guangdong 517000, China
| | - Nengheng Zheng
- Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, Guangdong 518052, China
| | - Qinglin Meng
- Acoustics Laboratory, School of Physics and Optoelectronics, South China University of Technology, Guangzhou, Guangdong 510641, China
| |
Collapse
|
3
|
Saddler MR, McDermott JH. Models optimized for real-world tasks reveal the task-dependent necessity of precise temporal coding in hearing. Nat Commun 2024; 15:10590. [PMID: 39632854 PMCID: PMC11618365 DOI: 10.1038/s41467-024-54700-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 11/18/2024] [Indexed: 12/07/2024] Open
Abstract
Neurons encode information in the timing of their spikes in addition to their firing rates. Spike timing is particularly precise in the auditory nerve, where action potentials phase lock to sound with sub-millisecond precision, but its behavioral relevance remains uncertain. We optimized machine learning models to perform real-world hearing tasks with simulated cochlear input, assessing the precision of auditory nerve spike timing needed to reproduce human behavior. Models with high-fidelity phase locking exhibited more human-like sound localization and speech perception than models without, consistent with an essential role in human hearing. However, the temporal precision needed to reproduce human-like behavior varied across tasks, as did the precision that benefited real-world task performance. These effects suggest that perceptual domains incorporate phase locking to different extents depending on the demands of real-world hearing. The results illustrate how optimizing models for realistic tasks can clarify the role of candidate neural codes in perception.
Collapse
Affiliation(s)
- Mark R Saddler
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA.
- McGovern Institute for Brain Research, MIT, Cambridge, MA, USA.
- Center for Brains, Minds, and Machines, MIT, Cambridge, MA, USA.
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA.
- McGovern Institute for Brain Research, MIT, Cambridge, MA, USA.
- Center for Brains, Minds, and Machines, MIT, Cambridge, MA, USA.
- Program in Speech and Hearing Biosciences and Technology, Harvard, Cambridge, MA, USA.
| |
Collapse
|
4
|
Cusimano M, Hewitt LB, McDermott JH. Listening with generative models. Cognition 2024; 253:105874. [PMID: 39216190 DOI: 10.1016/j.cognition.2024.105874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 03/31/2024] [Accepted: 07/03/2024] [Indexed: 09/04/2024]
Abstract
Perception has long been envisioned to use an internal model of the world to explain the causes of sensory signals. However, such accounts have historically not been testable, typically requiring intractable search through the space of possible explanations. Using auditory scenes as a case study, we leveraged contemporary computational tools to infer explanations of sounds in a candidate internal generative model of the auditory world (ecologically inspired audio synthesizers). Model inferences accounted for many classic illusions. Unlike traditional accounts of auditory illusions, the model is applicable to any sound, and exhibited human-like perceptual organization for real-world sound mixtures. The combination of stimulus-computability and interpretable model structure enabled 'rich falsification', revealing additional assumptions about sound generation needed to account for perception. The results show how generative models can account for the perception of both classic illusions and everyday sensory signals, and illustrate the opportunities and challenges involved in incorporating them into theories of perception.
Collapse
Affiliation(s)
- Maddie Cusimano
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, United States of America.
| | - Luke B Hewitt
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, United States of America
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, United States of America; McGovern Institute, Massachusetts Institute of Technology, United States of America; Center for Brains Minds and Machines, Massachusetts Institute of Technology, United States of America; Speech and Hearing Bioscience and Technology, Harvard University, United States of America.
| |
Collapse
|
5
|
Hicks JM, McDermott JH. Noise schemas aid hearing in noise. Proc Natl Acad Sci U S A 2024; 121:e2408995121. [PMID: 39546566 PMCID: PMC11588100 DOI: 10.1073/pnas.2408995121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Accepted: 10/14/2024] [Indexed: 11/17/2024] Open
Abstract
Human hearing is robust to noise, but the basis of this robustness is poorly understood. Several lines of evidence are consistent with the idea that the auditory system adapts to sound components that are stable over time, potentially achieving noise robustness by suppressing noise-like signals. Yet background noise often provides behaviorally relevant information about the environment and thus seems unlikely to be completely discarded by the auditory system. Motivated by this observation, we explored whether noise robustness might instead be mediated by internal models of noise structure that could facilitate the separation of background noise from other sounds. We found that detection, recognition, and localization in real-world background noise were better for foreground sounds positioned later in a noise excerpt, with performance improving over the initial second of exposure to a noise. These results are consistent with both adaptation-based and model-based accounts (adaptation increases over time and online noise estimation should benefit from acquiring more samples). However, performance was also robust to interruptions in the background noise and was enhanced for intermittently recurring backgrounds, neither of which would be expected from known forms of adaptation. Additionally, the performance benefit observed for foreground sounds occurring later within a noise excerpt was reduced for recurring noises, suggesting that a noise representation is built up during exposure to a new background noise and then maintained in memory. These findings suggest that noise robustness is supported by internal models-"noise schemas"-that are rapidly estimated, stored over time, and used to estimate other concurrent sounds.
Collapse
Affiliation(s)
- Jarrod M. Hicks
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA02139
- McGovern Institute, Massachusetts Institute of Technology, Cambridge, MA02139
- Center for Brains Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Josh H. McDermott
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA02139
- McGovern Institute, Massachusetts Institute of Technology, Cambridge, MA02139
- Center for Brains Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA02139
- Program in Speech and Hearing Bioscience and Technology, Harvard University, Boston, MA02115
| |
Collapse
|
6
|
Huo M, Sun Y, Fogerty D, Tang Y. Release from same-talker speech-in-speech masking: Effects of masker intelligibility and other contributing factorsa). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 156:2960-2973. [PMID: 39485097 DOI: 10.1121/10.0034235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 10/11/2024] [Indexed: 11/03/2024]
Abstract
Human speech perception declines in the presence of masking speech, particularly when the masker is intelligible and acoustically similar to the target. A prior investigation demonstrated a substantial reduction in masking when the intelligibility of competing speech was reduced by corrupting voiced segments with noise [Huo, Sun, Fogerty, and Tang (2023), "Quantifying informational masking due to masker intelligibility in same-talker speech-in-speech perception," in Interspeech 2023, pp. 1783-1787]. As this processing also reduced the prominence of voiced segments, it was unclear whether the unmasking was due to reduced linguistic content, acoustic similarity, or both. The current study compared the masking of original competing speech (high intelligibility) to competing speech with time reversal of voiced segments (VS-reversed, low intelligibility) at various target-to-masker ratios. Modeling results demonstrated similar energetic masking between the two maskers. However, intelligibility of the target speech was considerably better with the VS-reversed masker compared to the original masker, likely due to the reduced linguistic content. Further corrupting the masker's voiced segments resulted in additional release from masking. Acoustic analyses showed that the portion of target voiced segments overlapping with masker voiced segments and the similarity between target and masker overlapped voiced segments impacted listeners' speech recognition. Evidence also suggested modulation masking in the spectro-temporal domain interferes with listeners' ability to glimpse the target.
Collapse
Affiliation(s)
- Mingyue Huo
- Department of Linguistics, University of Illinois Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Yinglun Sun
- Department of Linguistics, University of Illinois Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Daniel Fogerty
- Department of Speech & Hearing Science, University of Illinois Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Yan Tang
- Department of Linguistics, University of Illinois Urbana-Champaign, Urbana, Illinois 61801, USA
- Beckman Institute for Advanced Science and Technology, University of Illinois Urbana-Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
7
|
Lee J, Oxenham AJ. Testing the role of temporal coherence on speech intelligibility with noise and single-talker maskers. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 156:3285-3297. [PMID: 39545746 PMCID: PMC11575144 DOI: 10.1121/10.0034420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 10/25/2024] [Indexed: 11/17/2024]
Abstract
Temporal coherence, where sounds with aligned timing patterns are perceived as a single source, is considered an essential cue in auditory scene analysis. However, its effects have been studied primarily with simple repeating tones, rather than speech. This study investigated the role of temporal coherence in speech by introducing across-frequency asynchronies. The effect of asynchrony on the intelligibility of target sentences was tested in the presence of background speech-shaped noise or a single-talker interferer. Our hypothesis was that disrupting temporal coherence should not only reduce intelligibility but also impair listeners' ability to segregate the target speech from an interfering talker, leading to greater degradation for speech-in-speech than speech-in-noise tasks. Stimuli were filtered into eight frequency bands, which were then desynchronized with delays of 0-120 ms. As expected, intelligibility declined as asynchrony increased. However, the decline was similar for both noise and single-talker maskers. Primarily target, rather than masker, asynchrony affected performance for both natural (forward) and reversed-speech maskers, and for target sentences with low and high semantic context. The results suggest that temporal coherence may not be as critical a cue for speech segregation as it is for the non-speech stimuli traditionally used in studies of auditory scene analysis.
Collapse
Affiliation(s)
- Jaeeun Lee
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Andrew J Oxenham
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota 55455, USA
| |
Collapse
|
8
|
Saddler MR, McDermott JH. Models optimized for real-world tasks reveal the task-dependent necessity of precise temporal coding in hearing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.21.590435. [PMID: 38712054 PMCID: PMC11071365 DOI: 10.1101/2024.04.21.590435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Neurons encode information in the timing of their spikes in addition to their firing rates. Spike timing is particularly precise in the auditory nerve, where action potentials phase lock to sound with sub-millisecond precision, but its behavioral relevance remains uncertain. We optimized machine learning models to perform real-world hearing tasks with simulated cochlear input, assessing the precision of auditory nerve spike timing needed to reproduce human behavior. Models with high-fidelity phase locking exhibited more human-like sound localization and speech perception than models without, consistent with an essential role in human hearing. However, the temporal precision needed to reproduce human-like behavior varied across tasks, as did the precision that benefited real-world task performance. These effects suggest that perceptual domains incorporate phase locking to different extents depending on the demands of real-world hearing. The results illustrate how optimizing models for realistic tasks can clarify the role of candidate neural codes in perception.
Collapse
Affiliation(s)
- Mark R Saddler
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA
- McGovern Institute for Brain Research, MIT, Cambridge, MA, USA
- Center for Brains, Minds, and Machines, MIT, Cambridge, MA, USA
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA
- McGovern Institute for Brain Research, MIT, Cambridge, MA, USA
- Center for Brains, Minds, and Machines, MIT, Cambridge, MA, USA
- Program in Speech and Hearing Biosciences and Technology, Harvard, Cambridge, MA, USA
| |
Collapse
|
9
|
Swerdlow NR, Gonzalez CE, Raza MU, Gautam D, Miyakoshi M, Clayson PE, Joshi YB, Molina JL, Talledo J, Thomas ML, Light GA, Sivarao DV. Effects of Memantine on the Auditory Steady-State and Harmonic Responses to 40 Hz Stimulation Across Species. BIOLOGICAL PSYCHIATRY. COGNITIVE NEUROSCIENCE AND NEUROIMAGING 2024; 9:346-355. [PMID: 37683728 PMCID: PMC12045617 DOI: 10.1016/j.bpsc.2023.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2023] [Revised: 07/21/2023] [Accepted: 08/28/2023] [Indexed: 09/10/2023]
Abstract
BACKGROUND Click trains elicit an auditory steady-state response (ASSR) at the driving frequency (1F) and its integer multiple frequencies (2F, 3F, etc.) called harmonics; we call this harmonic response the steady-state harmonic response (SSHR). We describe the 40 Hz ASSR (1F) and 80 Hz SSHR (2F) in humans and rats and their sensitivity to the uncompetitive NMDA antagonist memantine. METHODS In humans (healthy control participants, n = 25; patients with schizophrenia, n = 28), electroencephalography was recorded after placebo or 20 mg memantine in a within-participant crossover design. ASSR used 1 ms, 85-dB clicks presented in 250 40/s 500-ms trains. In freely moving rats (n = 9), electroencephalography was acquired after memantine (0, 0.3, 1, 3 mg/kg) in a within-participant crossover design; 65-dB click trains used 5-mV monophasic, 1-ms square waves (40/s). RESULTS Across species, ASSR at 1F generated greater evoked power (EP) than the 2F SSHR. 1F > 2F intertrial coherence (ITC) was also detected in humans, but the opposite relationship (ITC: 2F > 1F) was seen in rats. EP and ITC at 1F were deficient in patients and were enhanced by memantine across species. EP and ITC at 2F were deficient in patients. Measures at 2F were generally insensitive to memantine across species, although in humans the ITC harmonic ratio (1F:2F) was modestly enhanced by memantine, and in rats, both the EP and ITC harmonic ratios were significantly enhanced by memantine. CONCLUSIONS ASSR and SSHR are robust, nonredundant electroencephalography signals that are suitable for cross-species analyses that reveal potentially meaningful differences across species, diagnoses, and drugs.
Collapse
Affiliation(s)
- Neal R Swerdlow
- Department of Psychiatry, University of California San Diego School of Medicine, La Jolla, California; VISN 22 Mental Illness Research, Education, and Clinical Center, San Diego Veterans Administration Health System, La Jolla, California.
| | - Christopher E Gonzalez
- Department of Psychiatry, University of California San Diego School of Medicine, La Jolla, California; VISN 22 Mental Illness Research, Education, and Clinical Center, San Diego Veterans Administration Health System, La Jolla, California
| | - Muhammad Ummear Raza
- Pharmaceutical Sciences, Bill Gatton College of Pharmacy, East Tennessee State University, Johnson City, Tennessee
| | - Deepshila Gautam
- Pharmaceutical Sciences, Bill Gatton College of Pharmacy, East Tennessee State University, Johnson City, Tennessee
| | - Makoto Miyakoshi
- Division of Child and Adolescent Psychiatry, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio; Department of Psychiatry, University of Cincinnati College of Medicine, Cincinnati, Ohio
| | - Peter E Clayson
- Department of Psychology, University of South Florida, Tampa, Florida
| | - Yash B Joshi
- Department of Psychiatry, University of California San Diego School of Medicine, La Jolla, California; VISN 22 Mental Illness Research, Education, and Clinical Center, San Diego Veterans Administration Health System, La Jolla, California
| | - Juan L Molina
- Department of Psychiatry, University of California San Diego School of Medicine, La Jolla, California; VISN 22 Mental Illness Research, Education, and Clinical Center, San Diego Veterans Administration Health System, La Jolla, California
| | - Jo Talledo
- Department of Psychiatry, University of California San Diego School of Medicine, La Jolla, California
| | - Michael L Thomas
- Department of Psychology, Colorado State University, Fort Collins, Colorado
| | - Gregory A Light
- Department of Psychiatry, University of California San Diego School of Medicine, La Jolla, California; VISN 22 Mental Illness Research, Education, and Clinical Center, San Diego Veterans Administration Health System, La Jolla, California.
| | - Digavalli V Sivarao
- Pharmaceutical Sciences, Bill Gatton College of Pharmacy, East Tennessee State University, Johnson City, Tennessee
| |
Collapse
|
10
|
Gautam D, Raza MU, Miyakoshi M, Molina JL, Joshi YB, Clayson PE, Light GA, Swerdlow NR, Sivarao DV. Click-train evoked steady state harmonic response as a novel pharmacodynamic biomarker of cortical oscillatory synchrony. Neuropharmacology 2023; 240:109707. [PMID: 37673332 DOI: 10.1016/j.neuropharm.2023.109707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Revised: 07/25/2023] [Accepted: 08/31/2023] [Indexed: 09/08/2023]
Abstract
Sensory networks naturally entrain to rhythmic stimuli like a click train delivered at a particular frequency. Such synchronization is integral to information processing, can be measured by electroencephalography (EEG) and is an accessible index of neural network function. Click trains evoke neural entrainment not only at the driving frequency (F), referred to as the auditory steady state response (ASSR), but also at its higher multiples called the steady state harmonic response (SSHR). Since harmonics play an important and non-redundant role in acoustic information processing, we hypothesized that SSHR may differ from ASSR in presentation and pharmacological sensitivity. In female SD rats, a 2 s-long train stimulus was used to evoke ASSR at 20 Hz and its SSHR at 40, 60 and 80 Hz, recorded from a prefrontal epidural electrode. Narrow band evoked responses were evident at all frequencies; signal power was strongest at 20 Hz while phase synchrony was strongest at 80 Hz. SSHR at 40 Hz took the longest time (∼180 ms from stimulus onset) to establish synchrony. The NMDA antagonist MK801 (0.025-0.1 mg/kg) did not consistently affect 20 Hz ASSR phase synchrony but robustly and dose-dependently attenuated synchrony of all SSHR. Evoked power was attenuated by MK801 at 20 Hz ASSR and 40 Hz SSHR only. Thus, presentation as well as pharmacological sensitivity distinguished SSHR from ASSR, making them non-redundant markers of cortical network function. SSHR is a novel and promising translational biomarker of cortical oscillatory dynamics that may have important applications in CNS drug development and personalized medicine.
Collapse
Affiliation(s)
- Deepshila Gautam
- Department of Pharmaceutical Sciences, Bill Gatton College of Pharmacy, East Tennessee State University, Johnson City, TN, 37604, USA
| | - Muhammad Ummear Raza
- Department of Pharmaceutical Sciences, Bill Gatton College of Pharmacy, East Tennessee State University, Johnson City, TN, 37604, USA
| | - M Miyakoshi
- Division of Child and Adolescent Psychiatry, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - J L Molina
- Department of Psychiatry, UCSD School of Medicine, La Jolla, CA, USA; VISN 22 MIRECC, SD Veterans Administration Health System, La Jolla, CA, USA
| | - Y B Joshi
- Department of Psychiatry, UCSD School of Medicine, La Jolla, CA, USA; VISN 22 MIRECC, SD Veterans Administration Health System, La Jolla, CA, USA
| | - P E Clayson
- Department of Psychology, University of South Florida, Tampa, FL, USA
| | - G A Light
- Department of Psychiatry, UCSD School of Medicine, La Jolla, CA, USA; VISN 22 MIRECC, SD Veterans Administration Health System, La Jolla, CA, USA
| | - N R Swerdlow
- Department of Psychiatry, UCSD School of Medicine, La Jolla, CA, USA; VISN 22 MIRECC, SD Veterans Administration Health System, La Jolla, CA, USA
| | - Digavalli V Sivarao
- Department of Pharmaceutical Sciences, Bill Gatton College of Pharmacy, East Tennessee State University, Johnson City, TN, 37604, USA.
| |
Collapse
|
11
|
Whiteford KL, Oxenham AJ. Sensitivity to Frequency Modulation is Limited Centrally. J Neurosci 2023; 43:3687-3695. [PMID: 37028932 PMCID: PMC10198444 DOI: 10.1523/jneurosci.0995-22.2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 03/23/2023] [Accepted: 03/31/2023] [Indexed: 04/09/2023] Open
Abstract
Modulations in both amplitude and frequency are prevalent in natural sounds and are critical in defining their properties. Humans are exquisitely sensitive to frequency modulation (FM) at the slow modulation rates and low carrier frequencies that are common in speech and music. This enhanced sensitivity to slow-rate and low-frequency FM has been widely believed to reflect precise, stimulus-driven phase locking to temporal fine structure in the auditory nerve. At faster modulation rates and/or higher carrier frequencies, FM is instead thought to be coded by coarser frequency-to-place mapping, where FM is converted to amplitude modulation (AM) via cochlear filtering. Here, we show that patterns of human FM perception that have classically been explained by limits in peripheral temporal coding are instead better accounted for by constraints in the central processing of fundamental frequency (F0) or pitch. We measured FM detection in male and female humans using harmonic complex tones with an F0 within the range of musical pitch but with resolved harmonic components that were all above the putative limits of temporal phase locking (>8 kHz). Listeners were more sensitive to slow than fast FM rates, even though all components were beyond the limits of phase locking. In contrast, AM sensitivity remained better at faster than slower rates, regardless of carrier frequency. These findings demonstrate that classic trends in human FM sensitivity, previously attributed to auditory nerve phase locking, may instead reflect the constraints of a unitary code that operates at a more central level of processing.SIGNIFICANCE STATEMENT Natural sounds involve dynamic frequency and amplitude fluctuations. Humans are particularly sensitive to frequency modulation (FM) at slow rates and low carrier frequencies, which are prevalent in speech and music. This sensitivity has been ascribed to encoding of stimulus temporal fine structure (TFS) via phase-locked auditory nerve activity. To test this long-standing theory, we measured FM sensitivity using complex tones with a low F0 but only high-frequency harmonics beyond the limits of phase locking. Dissociating the F0 from TFS showed that FM sensitivity is limited not by peripheral encoding of TFS but rather by central processing of F0, or pitch. The results suggest a unitary code for FM detection limited by more central constraints.
Collapse
Affiliation(s)
- Kelly L Whiteford
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota 55455
| | - Andrew J Oxenham
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota 55455
| |
Collapse
|
12
|
McPherson MJ, McDermott JH. Relative pitch representations and invariance to timbre. Cognition 2023; 232:105327. [PMID: 36495710 PMCID: PMC10016107 DOI: 10.1016/j.cognition.2022.105327] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Revised: 09/13/2022] [Accepted: 11/10/2022] [Indexed: 12/12/2022]
Abstract
Information in speech and music is often conveyed through changes in fundamental frequency (f0), perceived by humans as "relative pitch". Relative pitch judgments are complicated by two facts. First, sounds can simultaneously vary in timbre due to filtering imposed by a vocal tract or instrument body. Second, relative pitch can be extracted in two ways: by measuring changes in constituent frequency components from one sound to another, or by estimating the f0 of each sound and comparing the estimates. We examined the effects of timbral differences on relative pitch judgments, and whether any invariance to timbre depends on whether judgments are based on constituent frequencies or their f0. Listeners performed up/down and interval discrimination tasks with pairs of spoken vowels, instrument notes, or synthetic tones, synthesized to be either harmonic or inharmonic. Inharmonic sounds lack a well-defined f0, such that relative pitch must be extracted from changes in individual frequencies. Pitch judgments were less accurate when vowels/instruments were different compared to when they were the same, and were biased by the associated timbre differences. However, this bias was similar for harmonic and inharmonic sounds, and was observed even in conditions where judgments of harmonic sounds were based on f0 representations. Relative pitch judgments are thus not invariant to timbre, even when timbral variation is naturalistic, and when such judgments are based on representations of f0.
Collapse
Affiliation(s)
- Malinda J McPherson
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139, United States of America; Program in Speech and Hearing Biosciences and Technology, Harvard University, Boston, MA 02115, United States of America; McGovern Institute for Brain Research, MIT, Cambridge, MA 02139, United States of America.
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139, United States of America; Program in Speech and Hearing Biosciences and Technology, Harvard University, Boston, MA 02115, United States of America; McGovern Institute for Brain Research, MIT, Cambridge, MA 02139, United States of America; Center for Brains Minds and Machines, MIT, Cambridge, MA 02139, United States of America
| |
Collapse
|
13
|
Steinmetzger K, Rosen S. No evidence for a benefit from masker harmonicity in the perception of speech in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:1064. [PMID: 36859153 DOI: 10.1121/10.0017065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 01/10/2023] [Indexed: 06/18/2023]
Abstract
When assessing the intelligibility of speech embedded in background noise, maskers with a harmonic spectral structure have been found to be much less detrimental to performance than noise-based interferers. While spectral "glimpsing" in between the resolved masker harmonics and reduced envelope modulations of harmonic maskers have been shown to contribute, this effect has primarily been attributed to the proposed ability of the auditory system to cancel harmonic maskers from the signal mixture. Here, speech intelligibility in the presence of harmonic and inharmonic maskers with similar spectral glimpsing opportunities and envelope modulation spectra was assessed to test the theory of harmonic cancellation. Speech reception thresholds obtained from normal-hearing listeners revealed no effect of masker harmonicity, neither for maskers with static nor dynamic pitch contours. The results show that harmonicity, or time-domain periodicity, as such, does not aid the segregation of speech and masker. Contrary to what might be assumed, this also implies that the saliency of the masker pitch did not affect auditory grouping. Instead, the current data suggest that the reduced masking effectiveness of harmonic sounds is due to the regular spacing of their spectral components.
Collapse
Affiliation(s)
- Kurt Steinmetzger
- Section of Biomagnetism, Department of Neurology, Heidelberg University Hospital, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany
| | - Stuart Rosen
- Speech, Hearing and Phonetic Sciences, University College London (UCL), Chandler House, 2 Wakefield Street, London, WC1N 1PF, United Kingdom
| |
Collapse
|
14
|
Basiński K, Quiroga-Martinez DR, Vuust P. Temporal hierarchies in the predictive processing of melody - From pure tones to songs. Neurosci Biobehav Rev 2023; 145:105007. [PMID: 36535375 DOI: 10.1016/j.neubiorev.2022.105007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 11/30/2022] [Accepted: 12/14/2022] [Indexed: 12/23/2022]
Abstract
Listening to musical melodies is a complex task that engages perceptual and memoryrelated processes. The processes underlying melody cognition happen simultaneously on different timescales, ranging from milliseconds to minutes. Although attempts have been made, research on melody perception is yet to produce a unified framework of how melody processing is achieved in the brain. This may in part be due to the difficulty of integrating concepts such as perception, attention and memory, which pertain to different temporal scales. Recent theories on brain processing, which hold prediction as a fundamental principle, offer potential solutions to this problem and may provide a unifying framework for explaining the neural processes that enable melody perception on multiple temporal levels. In this article, we review empirical evidence for predictive coding on the levels of pitch formation, basic pitch-related auditory patterns,more complex regularity processing extracted from basic patterns and long-term expectations related to musical syntax. We also identify areas that would benefit from further inquiry and suggest future directions in research on musical melody perception.
Collapse
Affiliation(s)
- Krzysztof Basiński
- Division of Quality of Life Research, Medical University of Gdańsk, Poland
| | - David Ricardo Quiroga-Martinez
- Helen Wills Neuroscience Institute & Department of Psychology, University of California Berkeley, USA; Center for Music in the Brain, Aarhus University & The Royal Academy of Music, Denmark
| | - Peter Vuust
- Center for Music in the Brain, Aarhus University & The Royal Academy of Music, Denmark
| |
Collapse
|
15
|
Siedenburg K, Graves J, Pressnitzer D. A unitary model of auditory frequency change perception. PLoS Comput Biol 2023; 19:e1010307. [PMID: 36634121 PMCID: PMC9876382 DOI: 10.1371/journal.pcbi.1010307] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 01/25/2023] [Accepted: 01/04/2023] [Indexed: 01/13/2023] Open
Abstract
Changes in the frequency content of sounds over time are arguably the most basic form of information about the behavior of sound-emitting objects. In perceptual studies, such changes have mostly been investigated separately, as aspects of either pitch or timbre. Here, we propose a unitary account of "up" and "down" subjective judgments of frequency change, based on a model combining auditory correlates of acoustic cues in a sound-specific and listener-specific manner. To do so, we introduce a generalized version of so-called Shepard tones, allowing symmetric manipulations of spectral information on a fine scale, usually associated to pitch (spectral fine structure, SFS), and on a coarse scale, usually associated timbre (spectral envelope, SE). In a series of behavioral experiments, listeners reported "up" or "down" shifts across pairs of generalized Shepard tones that differed in SFS, in SE, or in both. We observed the classic properties of Shepard tones for either SFS or SE shifts: subjective judgements followed the smallest log-frequency change direction, with cases of ambiguity and circularity. Interestingly, when both SFS and SE changes were applied concurrently (synergistically or antagonistically), we observed a trade-off between cues. Listeners were encouraged to report when they perceived "both" directions of change concurrently, but this rarely happened, suggesting a unitary percept. A computational model could accurately fit the behavioral data by combining different cues reflecting frequency changes after auditory filtering. The model revealed that cue weighting depended on the nature of the sound. When presented with harmonic sounds, listeners put more weight on SFS-related cues, whereas inharmonic sounds led to more weight on SE-related cues. Moreover, these stimulus-based factors were modulated by inter-individual differences, revealing variability across listeners in the detailed recipe for "up" and "down" judgments. We argue that frequency changes are tracked perceptually via the adaptive combination of a diverse set of cues, in a manner that is in fact similar to the derivation of other basic auditory dimensions such as spatial location.
Collapse
Affiliation(s)
- Kai Siedenburg
- Carl von Ossietzky University of Oldenburg, Dept. of Medical Physics and Acoustics, Oldenburg, Germany
- * E-mail:
| | - Jackson Graves
- Laboratoire des systèmes perceptifs, Dépt. d’études cognitives, École normale supérieure, PSL University, CNRS, Paris, France
| | - Daniel Pressnitzer
- Laboratoire des systèmes perceptifs, Dépt. d’études cognitives, École normale supérieure, PSL University, CNRS, Paris, France
| |
Collapse
|
16
|
Lanzilotti C, Andéol G, Micheyl C, Scannella S. Cocktail party training induces increased speech intelligibility and decreased cortical activity in bilateral inferior frontal gyri. A functional near-infrared study. PLoS One 2022; 17:e0277801. [PMID: 36454948 PMCID: PMC9714910 DOI: 10.1371/journal.pone.0277801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 11/03/2022] [Indexed: 12/03/2022] Open
Abstract
The human brain networks responsible for selectively listening to a voice amid other talkers remain to be clarified. The present study aimed to investigate relationships between cortical activity and performance in a speech-in-speech task, before (Experiment I) and after training-induced improvements (Experiment II). In Experiment I, 74 participants performed a speech-in-speech task while their cortical activity was measured using a functional near infrared spectroscopy (fNIRS) device. One target talker and one masker talker were simultaneously presented at three different target-to-masker ratios (TMRs): adverse, intermediate and favorable. Behavioral results show that performance may increase monotonically with TMR in some participants and failed to decrease, or even improved, in the adverse-TMR condition for others. On the neural level, an extensive brain network including the frontal (left prefrontal cortex, right dorsolateral prefrontal cortex and bilateral inferior frontal gyri) and temporal (bilateral auditory cortex) regions was more solicited by the intermediate condition than the two others. Additionally, bilateral frontal gyri and left auditory cortex activities were found to be positively correlated with behavioral performance in the adverse-TMR condition. In Experiment II, 27 participants, whose performance was the poorest in the adverse-TMR condition of Experiment I, were trained to improve performance in that condition. Results show significant performance improvements along with decreased activity in bilateral inferior frontal gyri, the right dorsolateral prefrontal cortex, the left inferior parietal cortex and the right auditory cortex in the adverse-TMR condition after training. Arguably, lower neural activity reflects higher efficiency in processing masker inhibition after speech-in-speech training. As speech-in-noise tasks also imply frontal and temporal regions, we suggest that regardless of the type of masking (speech or noise) the complexity of the task will prompt the implication of a similar brain network. Furthermore, the initial significant cognitive recruitment will be reduced following a training leading to an economy of cognitive resources.
Collapse
Affiliation(s)
- Cosima Lanzilotti
- Département Neuroscience et Sciences Cognitives, Institut de Recherche Biomédicale des Armées, Brétigny sur Orge, France
- ISAE-SUPAERO, Université de Toulouse, Toulouse, France
- Thales SIX GTS France, Gennevilliers, France
| | - Guillaume Andéol
- Département Neuroscience et Sciences Cognitives, Institut de Recherche Biomédicale des Armées, Brétigny sur Orge, France
| | | | | |
Collapse
|
17
|
Prud'homme L, Lavandier M, Best V. Investigating the role of harmonic cancellation in speech-on-speech masking. Hear Res 2022; 426:108562. [PMID: 35768309 PMCID: PMC9722527 DOI: 10.1016/j.heares.2022.108562] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 04/26/2022] [Accepted: 06/15/2022] [Indexed: 11/30/2022]
Abstract
This study investigated the role of harmonic cancellation in the intelligibility of speech in "cocktail party" situations. While there is evidence that harmonic cancellation plays a role in the segregation of simple harmonic sounds based on fundamental frequency (F0), its utility for mixtures of speech containing non-stationary F0s and unvoiced segments is unclear. Here we focused on the energetic masking of speech targets caused by competing speech maskers. Speech reception thresholds were measured using seven maskers: speech-shaped noise, monotonized and intonated harmonic complexes, monotonized speech, noise-vocoded speech, reversed speech and natural speech. These maskers enabled an estimate of how the masking potential of speech is influenced by harmonic structure, amplitude modulation and variations in F0 over time. Measured speech reception thresholds were compared to the predictions of two computational models, with and without a harmonic cancellation component. Overall, the results suggest a minor role of harmonic cancellation in reducing energetic masking in speech mixtures.
Collapse
Affiliation(s)
- Luna Prud'homme
- Univ Lyon, ENTPE, Ecole Centrale de Lyon, CNRS, LTDS, UMR5513, 69518 Vaulx-en-Velin, France
| | - Mathieu Lavandier
- Univ Lyon, ENTPE, Ecole Centrale de Lyon, CNRS, LTDS, UMR5513, 69518 Vaulx-en-Velin, France.
| | - Virginia Best
- Department of Speech, Language and Hearing Sciences, Boston University, 635 Commonwealth Ave, Boston, MA, 02215, USA
| |
Collapse
|
18
|
Quiroga‐Martinez DR, Basiński K, Nasielski J, Tillmann B, Brattico E, Cholvy F, Fornoni L, Vuust P, Caclin A. Enhanced mismatch negativity in harmonic compared with inharmonic sounds. Eur J Neurosci 2022; 56:4583-4599. [PMID: 35833941 PMCID: PMC9543822 DOI: 10.1111/ejn.15769] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 07/07/2022] [Accepted: 07/08/2022] [Indexed: 11/29/2022]
Abstract
Many natural sounds have frequency spectra composed of integer multiples of a fundamental frequency. This property, known as harmonicity, plays an important role in auditory information processing. However, the extent to which harmonicity influences the processing of sound features beyond pitch is still unclear. This is interesting because harmonic sounds have lower information entropy than inharmonic sounds. According to predictive processing accounts of perception, this property could produce more salient neural responses due to the brain's weighting of sensory signals according to their uncertainty. In the present study, we used electroencephalography to investigate brain responses to harmonic and inharmonic sounds commonly occurring in music: Piano tones and hi-hat cymbal sounds. In a multifeature oddball paradigm, we measured mismatch negativity (MMN) and P3a responses to timbre, intensity, and location deviants in listeners with and without congenital amusia-an impairment of pitch processing. As hypothesized, we observed larger amplitudes and earlier latencies (for both MMN and P3a) in harmonic compared with inharmonic sounds. These harmonicity effects were modulated by sound feature. Moreover, the difference in P3a latency between harmonic and inharmonic sounds was larger for controls than amusics. We propose an explanation of these results based on predictive coding and discuss the relationship between harmonicity, information entropy, and precision weighting of prediction errors.
Collapse
Affiliation(s)
- David Ricardo Quiroga‐Martinez
- Helen Wills Neuroscience InstituteUniversity of California BerkeleyBerkeleyCAUSA
- Center for Music in the BrainAarhus University & The Royal Academy of MusicAarhusDenmark
| | - Krzysztof Basiński
- Division of Quality of Life Research, Faculty of Health SciencesMedical University of GdańskGdańskPoland
| | | | - Barbara Tillmann
- Lyon Neuroscience Research CenterCNRS, UMR5292; INSERM, U1028LyonFrance
- University Lyon 1LyonFrance
| | - Elvira Brattico
- Center for Music in the BrainAarhus University & The Royal Academy of MusicAarhusDenmark
- Department of Educational Sciences, Psychology and CommunicationUniversity of Bari Aldo MoroBariItaly
| | - Fanny Cholvy
- Lyon Neuroscience Research CenterCNRS, UMR5292; INSERM, U1028LyonFrance
- University Lyon 1LyonFrance
| | - Lesly Fornoni
- Lyon Neuroscience Research CenterCNRS, UMR5292; INSERM, U1028LyonFrance
- University Lyon 1LyonFrance
| | - Peter Vuust
- Center for Music in the BrainAarhus University & The Royal Academy of MusicAarhusDenmark
| | - Anne Caclin
- Lyon Neuroscience Research CenterCNRS, UMR5292; INSERM, U1028LyonFrance
- University Lyon 1LyonFrance
| |
Collapse
|
19
|
Monson BB, Buss E. On the use of the TIMIT, QuickSIN, NU-6, and other widely used bandlimited speech materials for speech perception experiments. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:1639. [PMID: 36182310 PMCID: PMC9473723 DOI: 10.1121/10.0013993] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 07/20/2022] [Accepted: 08/20/2022] [Indexed: 05/29/2023]
Abstract
The use of spectrally degraded speech signals deprives listeners of acoustic information that is useful for speech perception. Several popular speech corpora, recorded decades ago, have spectral degradations, including limited extended high-frequency (EHF) (>8 kHz) content. Although frequency content above 8 kHz is often assumed to play little or no role in speech perception, recent research suggests that EHF content in speech can have a significant beneficial impact on speech perception under a wide range of natural listening conditions. This paper provides an analysis of the spectral content of popular speech corpora used for speech perception research to highlight the potential shortcomings of using bandlimited speech materials. Two corpora analyzed here, the TIMIT and NU-6, have substantial low-frequency spectral degradation (<500 Hz) in addition to EHF degradation. We provide an overview of the phenomena potentially missed by using bandlimited speech signals, and the factors to consider when selecting stimuli that are sensitive to these effects.
Collapse
Affiliation(s)
- Brian B Monson
- Department of Speech and Hearing Science, University of Illinois Urbana-Champaign, Champaign, Illinois 61820, USA
| | - Emily Buss
- Department of Otolaryngology/HNS, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27514, USA
| |
Collapse
|
20
|
Buss E, Miller MK, Leibold LJ. Maturation of Speech-in-Speech Recognition for Whispered and Voiced Speech. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2022; 65:3117-3128. [PMID: 35868232 PMCID: PMC9911131 DOI: 10.1044/2022_jslhr-21-00620] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 04/01/2022] [Accepted: 04/29/2022] [Indexed: 06/15/2023]
Abstract
PURPOSE Some speech recognition data suggest that children rely less on voice pitch and harmonicity to support auditory scene analysis than adults. Two experiments evaluated development of speech-in-speech recognition using voiced speech and whispered speech, which lacks the harmonic structure of voiced speech. METHOD Listeners were 5- to 7-year-olds and adults with normal hearing. Targets were monosyllabic words organized into three-word sets that differ in vowel content. Maskers were two-talker or one-talker streams of speech. Targets and maskers were recorded by different female talkers in both voiced and whispered speaking styles. For each masker, speech reception thresholds (SRTs) were measured in all four combinations of target and masker speech, including matched and mismatched speaking styles for the target and masker. RESULTS Children performed more poorly than adults overall. For the two-talker masker, this age effect was smaller for the whispered target and masker than for the other three conditions. Children's SRTs in this condition were predominantly positive, suggesting that they may have relied on a wholistic listening strategy rather than segregating the target from the masker. For the one-talker masker, age effects were consistent across the four conditions. Reduced informational masking for the one-talker masker could be responsible for differences in age effects for the two maskers. A benefit of mismatching the target and masker speaking style was observed for both target styles in the two-talker masker and for the voiced targets in the one-talker masker. CONCLUSIONS These results provide no compelling evidence that young school-age children and adults are differentially sensitive to the cues present in voiced and whispered speech. Both groups benefit from mismatches in speaking style under some conditions. These benefits could be due to a combination of reduced perceptual similarity, harmonic cancelation, and differences in energetic masking.
Collapse
Affiliation(s)
- Emily Buss
- Department of Otolaryngology-Head and Neck Surgery, University of North Carolina at Chapel Hill
| | - Margaret K. Miller
- Center for Hearing Research, Boys Town National Research Hospital, Omaha, NE
| | - Lori J. Leibold
- Center for Hearing Research, Boys Town National Research Hospital, Omaha, NE
| |
Collapse
|
21
|
Brodbeck C, Simon JZ. Cortical tracking of voice pitch in the presence of multiple speakers depends on selective attention. Front Neurosci 2022; 16:828546. [PMID: 36003957 PMCID: PMC9393379 DOI: 10.3389/fnins.2022.828546] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Accepted: 07/08/2022] [Indexed: 11/13/2022] Open
Abstract
Voice pitch carries linguistic and non-linguistic information. Previous studies have described cortical tracking of voice pitch in clean speech, with responses reflecting both pitch strength and pitch value. However, pitch is also a powerful cue for auditory stream segregation, especially when competing streams have pitch differing in fundamental frequency, as is the case when multiple speakers talk simultaneously. We therefore investigated how cortical speech pitch tracking is affected in the presence of a second, task-irrelevant speaker. We analyzed human magnetoencephalography (MEG) responses to continuous narrative speech, presented either as a single talker in a quiet background or as a two-talker mixture of a male and a female speaker. In clean speech, voice pitch was associated with a right-dominant response, peaking at a latency of around 100 ms, consistent with previous electroencephalography and electrocorticography results. The response tracked both the presence of pitch and the relative value of the speaker's fundamental frequency. In the two-talker mixture, the pitch of the attended speaker was tracked bilaterally, regardless of whether or not there was simultaneously present pitch in the speech of the irrelevant speaker. Pitch tracking for the irrelevant speaker was reduced: only the right hemisphere still significantly tracked pitch of the unattended speaker, and only during intervals in which no pitch was present in the attended talker's speech. Taken together, these results suggest that pitch-based segregation of multiple speakers, at least as measured by macroscopic cortical tracking, is not entirely automatic but strongly dependent on selective attention.
Collapse
Affiliation(s)
- Christian Brodbeck
- Department of Psychological Sciences, University of Connecticut, Storrs, CT, United States
- Institute for Systems Research, University of Maryland, College Park, College Park, MD, United States
| | - Jonathan Z. Simon
- Institute for Systems Research, University of Maryland, College Park, College Park, MD, United States
- Department of Electrical and Computer Engineering, University of Maryland, College Park, College Park, MD, United States
- Department of Biology, University of Maryland, College Park, College Park, MD, United States
| |
Collapse
|
22
|
Zong N, Wu M. A Computational Model for Evaluating Transient Auditory Storage of Acoustic Features in Normal Listeners. SENSORS (BASEL, SWITZERLAND) 2022; 22:5033. [PMID: 35808528 PMCID: PMC9269764 DOI: 10.3390/s22135033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 06/28/2022] [Accepted: 06/29/2022] [Indexed: 02/05/2023]
Abstract
Humans are able to detect an instantaneous change in correlation, demonstrating an ability to temporally process extremely rapid changes in interaural configurations. This temporal dynamic is correlated with human listeners' ability to store acoustic features in a transient auditory manner. The present study investigated whether the ability of transient auditory storage of acoustic features was affected by the interaural delay, which was assessed by measuring the sensitivity for detecting the instantaneous change in correlation for both wideband and narrowband correlated noise with various interaural delays. Furthermore, whether an instantaneous change in correlation between correlated interaural narrowband or wideband noise was detectable when introducing the longest interaural delay was investigated. Then, an auditory computational description model was applied to explore the relationship between wideband and narrowband simulation noise with various center frequencies in the auditory processes of lower-level transient memory of acoustic features. The computing results indicate that low-frequency information dominated perception and was more distinguishable in length than the high-frequency components, and the longest interaural delay for narrowband noise signals was highly correlated with that for wideband noise signals in the dynamic process of auditory perception.
Collapse
Affiliation(s)
| | - Meihong Wu
- School of Informatics, Xiamen University, Xiamen 361005, China;
| |
Collapse
|
23
|
Dellaferrera G, Asabuki T, Fukai T. Modeling the Repetition-Based Recovering of Acoustic and Visual Sources With Dendritic Neurons. Front Neurosci 2022; 16:855753. [PMID: 35573290 PMCID: PMC9097820 DOI: 10.3389/fnins.2022.855753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 03/31/2022] [Indexed: 11/13/2022] Open
Abstract
In natural auditory environments, acoustic signals originate from the temporal superimposition of different sound sources. The problem of inferring individual sources from ambiguous mixtures of sounds is known as blind source decomposition. Experiments on humans have demonstrated that the auditory system can identify sound sources as repeating patterns embedded in the acoustic input. Source repetition produces temporal regularities that can be detected and used for segregation. Specifically, listeners can identify sounds occurring more than once across different mixtures, but not sounds heard only in a single mixture. However, whether such a behavior can be computationally modeled has not yet been explored. Here, we propose a biologically inspired computational model to perform blind source separation on sequences of mixtures of acoustic stimuli. Our method relies on a somatodendritic neuron model trained with a Hebbian-like learning rule which was originally conceived to detect spatio-temporal patterns recurring in synaptic inputs. We show that the segregation capabilities of our model are reminiscent of the features of human performance in a variety of experimental settings involving synthesized sounds with naturalistic properties. Furthermore, we extend the study to investigate the properties of segregation on task settings not yet explored with human subjects, namely natural sounds and images. Overall, our work suggests that somatodendritic neuron models offer a promising neuro-inspired learning strategy to account for the characteristics of the brain segregation capabilities as well as to make predictions on yet untested experimental settings.
Collapse
Affiliation(s)
- Giorgia Dellaferrera
- Neural Coding and Brain Computing Unit, Okinawa Institute of Science and Technology, Okinawa, Japan
- Institute of Neuroinformatics, University of Zurich and Swiss Federal Institute of Technology Zurich (ETH), Zurich, Switzerland
| | - Toshitake Asabuki
- Neural Coding and Brain Computing Unit, Okinawa Institute of Science and Technology, Okinawa, Japan
| | - Tomoki Fukai
- Neural Coding and Brain Computing Unit, Okinawa Institute of Science and Technology, Okinawa, Japan
| |
Collapse
|
24
|
Abstract
Hearing in noise is a core problem in audition, and a challenge for hearing-impaired listeners, yet the underlying mechanisms are poorly understood. We explored whether harmonic frequency relations, a signature property of many communication sounds, aid hearing in noise for normal hearing listeners. We measured detection thresholds in noise for tones and speech synthesized to have harmonic or inharmonic spectra. Harmonic signals were consistently easier to detect than otherwise identical inharmonic signals. Harmonicity also improved discrimination of sounds in noise. The largest benefits were observed for two-note up-down "pitch" discrimination and melodic contour discrimination, both of which could be performed equally well with harmonic and inharmonic tones in quiet, but which showed large harmonic advantages in noise. The results show that harmonicity facilitates hearing in noise, plausibly by providing a noise-robust pitch cue that aids detection and discrimination.
Collapse
|
25
|
Guest DR, Oxenham AJ. Human discrimination and modeling of high-frequency complex tones shed light on the neural codes for pitch. PLoS Comput Biol 2022; 18:e1009889. [PMID: 35239639 PMCID: PMC8923464 DOI: 10.1371/journal.pcbi.1009889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Revised: 03/15/2022] [Accepted: 02/02/2022] [Indexed: 11/24/2022] Open
Abstract
Accurate pitch perception of harmonic complex tones is widely believed to rely on temporal fine structure information conveyed by the precise phase-locked responses of auditory-nerve fibers. However, accurate pitch perception remains possible even when spectrally resolved harmonics are presented at frequencies beyond the putative limits of neural phase locking, and it is unclear whether residual temporal information, or a coarser rate-place code, underlies this ability. We addressed this question by measuring human pitch discrimination at low and high frequencies for harmonic complex tones, presented either in isolation or in the presence of concurrent complex-tone maskers. We found that concurrent complex-tone maskers impaired performance at both low and high frequencies, although the impairment introduced by adding maskers at high frequencies relative to low frequencies differed between the tested masker types. We then combined simulated auditory-nerve responses to our stimuli with ideal-observer analysis to quantify the extent to which performance was limited by peripheral factors. We found that the worsening of both frequency discrimination and F0 discrimination at high frequencies could be well accounted for (in relative terms) by optimal decoding of all available information at the level of the auditory nerve. A Python package is provided to reproduce these results, and to simulate responses to acoustic stimuli from the three previously published models of the human auditory nerve used in our analyses.
Collapse
Affiliation(s)
- Daniel R. Guest
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Andrew J. Oxenham
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota, United States of America
| |
Collapse
|
26
|
Luberadzka J, Kayser H, Hohmann V. Making sense of periodicity glimpses in a prediction-update-loop-A computational model of attentive voice tracking. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 151:712. [PMID: 35232067 PMCID: PMC9088677 DOI: 10.1121/10.0009337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 11/13/2021] [Accepted: 01/03/2022] [Indexed: 06/14/2023]
Abstract
Humans are able to follow a speaker even in challenging acoustic conditions. The perceptual mechanisms underlying this ability remain unclear. A computational model of attentive voice tracking, consisting of four computational blocks: (1) sparse periodicity-based auditory features (sPAF) extraction, (2) foreground-background segregation, (3) state estimation, and (4) top-down knowledge, is presented. The model connects the theories about auditory glimpses, foreground-background segregation, and Bayesian inference. It is implemented with the sPAF, sequential Monte Carlo sampling, and probabilistic voice models. The model is evaluated by comparing it with the human data obtained in the study by Woods and McDermott [Curr. Biol. 25(17), 2238-2246 (2015)], which measured the ability to track one of two competing voices with time-varying parameters [fundamental frequency (F0) and formants (F1,F2)]. Three model versions were tested, which differ in the type of information used for the segregation: version (a) uses the oracle F0, version (b) uses the estimated F0, and version (c) uses the spectral shape derived from the estimated F0 and oracle F1 and F2. Version (a) simulates the optimal human performance in conditions with the largest separation between the voices, version (b) simulates the conditions in which the separation in not sufficient to follow the voices, and version (c) is closest to the human performance for moderate voice separation.
Collapse
Affiliation(s)
- Joanna Luberadzka
- Auditory Signal Processing, Department of Medical Physics and Acoustics, University of Oldenburg, Germany
| | - Hendrik Kayser
- Auditory Signal Processing, Department of Medical Physics and Acoustics, University of Oldenburg, Germany
| | - Volker Hohmann
- Auditory Signal Processing, Department of Medical Physics and Acoustics, University of Oldenburg, Germany
| |
Collapse
|
27
|
Homma NY, Bajo VM. Lemniscal Corticothalamic Feedback in Auditory Scene Analysis. Front Neurosci 2021; 15:723893. [PMID: 34489635 PMCID: PMC8417129 DOI: 10.3389/fnins.2021.723893] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 07/30/2021] [Indexed: 12/15/2022] Open
Abstract
Sound information is transmitted from the ear to central auditory stations of the brain via several nuclei. In addition to these ascending pathways there exist descending projections that can influence the information processing at each of these nuclei. A major descending pathway in the auditory system is the feedback projection from layer VI of the primary auditory cortex (A1) to the ventral division of medial geniculate body (MGBv) in the thalamus. The corticothalamic axons have small glutamatergic terminals that can modulate thalamic processing and thalamocortical information transmission. Corticothalamic neurons also provide input to GABAergic neurons of the thalamic reticular nucleus (TRN) that receives collaterals from the ascending thalamic axons. The balance of corticothalamic and TRN inputs has been shown to refine frequency tuning, firing patterns, and gating of MGBv neurons. Therefore, the thalamus is not merely a relay stage in the chain of auditory nuclei but does participate in complex aspects of sound processing that include top-down modulations. In this review, we aim (i) to examine how lemniscal corticothalamic feedback modulates responses in MGBv neurons, and (ii) to explore how the feedback contributes to auditory scene analysis, particularly on frequency and harmonic perception. Finally, we will discuss potential implications of the role of corticothalamic feedback in music and speech perception, where precise spectral and temporal processing is essential.
Collapse
Affiliation(s)
- Natsumi Y. Homma
- Center for Integrative Neuroscience, University of California, San Francisco, San Francisco, CA, United States
- Coleman Memorial Laboratory, Department of Otolaryngology – Head and Neck Surgery, University of California, San Francisco, San Francisco, CA, United States
| | - Victoria M. Bajo
- Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
28
|
No interaction between fundamental-frequency differences and spectral region when perceiving speech in a speech background. PLoS One 2021; 16:e0249654. [PMID: 33826663 PMCID: PMC8026035 DOI: 10.1371/journal.pone.0249654] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Accepted: 03/22/2021] [Indexed: 02/06/2023] Open
Abstract
Differences in fundamental frequency (F0) or pitch between competing voices facilitate our ability to segregate a target voice from interferers, thereby enhancing speech intelligibility. Although lower-numbered harmonics elicit a stronger and more accurate pitch sensation than higher-numbered harmonics, it is unclear whether the stronger pitch leads to an increased benefit of pitch differences when segregating competing talkers. To answer this question, sentence recognition was tested in young normal-hearing listeners in the presence of a single competing talker. The stimuli were presented in a broadband condition or were highpass or lowpass filtered to manipulate the pitch accuracy of the voicing, while maintaining roughly equal speech intelligibility in the highpass and lowpass regions. Performance was measured with average F0 differences (ΔF0) between the target and single-talker masker of 0, 2, and 4 semitones. Pitch discrimination abilities were also measured to confirm that the lowpass-filtered stimuli elicited greater pitch accuracy than the highpass-filtered stimuli. No interaction was found between filter type and ΔF0 in the sentence recognition task, suggesting little or no effect of harmonic rank or pitch accuracy on the ability to use F0 to segregate natural voices, even when the average ΔF0 is relatively small. The results suggest that listeners are able to obtain some benefit of pitch differences between competing voices, even when pitch salience and accuracy is low.
Collapse
|
29
|
Demany L, Monteiro G, Semal C, Shamma S, Carlyon RP. The perception of octave pitch affinity and harmonic fusion have a common origin. Hear Res 2021; 404:108213. [PMID: 33662686 PMCID: PMC7614450 DOI: 10.1016/j.heares.2021.108213] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 02/05/2021] [Accepted: 02/10/2021] [Indexed: 02/06/2023]
Abstract
Musicians say that the pitches of tones with a frequency ratio of 2:1 (one octave) have a distinctive affinity, even if the tones do not have common spectral components. It has been suggested, however, that this affinity judgment has no biological basis and originates instead from an acculturation process ‒ the learning of musical rules unrelated to auditory physiology. We measured, in young amateur musicians, the perceptual detectability of octave mistunings for tones presented alternately (melodic condition) or simultaneously (harmonic condition). In the melodic condition, mistuning was detectable only by means of explicit pitch comparisons. In the harmonic condition, listeners could use a different and more efficient perceptual cue: in the absence of mistuning, the tones fused into a single sound percept; mistunings decreased fusion. Performance was globally better in the harmonic condition, in line with the hypothesis that listeners used a fusion cue in this condition; this hypothesis was also supported by results showing that an illusory simultaneity of the tones was much less advantageous than a real simultaneity. In the two conditions, mistuning detection was generally better for octave compressions than for octave stretchings. This asymmetry varied across listeners, but crucially the listener-specific asymmetries observed in the two conditions were highly correlated. Thus, the perception of the melodic octave appeared to be closely linked to the phenomenon of harmonic fusion. As harmonic fusion is thought to be determined by biological factors rather than factors related to musical culture or training, we argue that octave pitch affinity also has, at least in part, a biological basis.
Collapse
Affiliation(s)
- Laurent Demany
- Institut de Neurosciences Cognitives et Intégratives d'Aquitaine, CNRS, EPHE, and Université de Bordeaux, Bordeaux, France.
| | - Guilherme Monteiro
- Institut de Neurosciences Cognitives et Intégratives d'Aquitaine, CNRS, EPHE, and Université de Bordeaux, Bordeaux, France
| | - Catherine Semal
- Institut de Neurosciences Cognitives et Intégratives d'Aquitaine, CNRS, EPHE, and Université de Bordeaux, Bordeaux, France; Bordeaux INP, Bordeaux, France.
| | - Shihab Shamma
- Institute for Systems Research, University of Maryland, College Park, MD, United States; Département d'Etudes Cognitives, Ecole Normale Supérieure, Paris, France.
| | - Robert P Carlyon
- Cambridge Hearing Group, MRC Cognition and Brain Sciences Unit, Cambridge, United Kingdom.
| |
Collapse
|
30
|
de Cheveigné A. Harmonic Cancellation-A Fundamental of Auditory Scene Analysis. Trends Hear 2021; 25:23312165211041422. [PMID: 34698574 PMCID: PMC8552394 DOI: 10.1177/23312165211041422] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/23/2021] [Accepted: 07/09/2021] [Indexed: 11/16/2022] Open
Abstract
This paper reviews the hypothesis of harmonic cancellation according to which an interfering sound is suppressed or canceled on the basis of its harmonicity (or periodicity in the time domain) for the purpose of Auditory Scene Analysis. It defines the concept, discusses theoretical arguments in its favor, and reviews experimental results that support it, or not. If correct, the hypothesis may draw on time-domain processing of temporally accurate neural representations within the brainstem, as required also by the classic equalization-cancellation model of binaural unmasking. The hypothesis predicts that a target sound corrupted by interference will be easier to hear if the interference is harmonic than inharmonic, all else being equal. This prediction is borne out in a number of behavioral studies, but not all. The paper reviews those results, with the aim to understand the inconsistencies and come up with a reliable conclusion for, or against, the hypothesis of harmonic cancellation within the auditory system.
Collapse
Affiliation(s)
- Alain de Cheveigné
- Laboratoire des systèmes perceptifs, CNRS, Paris, France
- Département d’études cognitives, École normale supérieure, PSL
University, Paris, France
- UCL Ear Institute, London, UK
| |
Collapse
|
31
|
Abstract
Being able to pick out particular sounds, such as speech, against a background of other sounds represents one of the key tasks performed by the auditory system. Understanding how this happens is important because speech recognition in noise is particularly challenging for older listeners and for people with hearing impairments. Central to this ability is the capacity of neurons to adapt to the statistics of sounds reaching the ears, which helps to generate noise-tolerant representations of sounds in the brain. In more complex auditory scenes, such as a cocktail party — where the background noise comprises other voices, sound features associated with each source have to be grouped together and segregated from those belonging to other sources. This depends on precise temporal coding and modulation of cortical response properties when attending to a particular speaker in a multi-talker environment. Furthermore, the neural processing underlying auditory scene analysis is shaped by experience over multiple timescales.
Collapse
|
32
|
Bidelman GM, Yoo J. Musicians Show Improved Speech Segregation in Competitive, Multi-Talker Cocktail Party Scenarios. Front Psychol 2020; 11:1927. [PMID: 32973610 PMCID: PMC7461890 DOI: 10.3389/fpsyg.2020.01927] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2020] [Accepted: 07/13/2020] [Indexed: 12/05/2022] Open
Abstract
Studies suggest that long-term music experience enhances the brain’s ability to segregate speech from noise. Musicians’ “speech-in-noise (SIN) benefit” is based largely on perception from simple figure-ground tasks rather than competitive, multi-talker scenarios that offer realistic spatial cues for segregation and engage binaural processing. We aimed to investigate whether musicians show perceptual advantages in cocktail party speech segregation in a competitive, multi-talker environment. We used the coordinate response measure (CRM) paradigm to measure speech recognition and localization performance in musicians vs. non-musicians in a simulated 3D cocktail party environment conducted in an anechoic chamber. Speech was delivered through a 16-channel speaker array distributed around the horizontal soundfield surrounding the listener. Participants recalled the color, number, and perceived location of target callsign sentences. We manipulated task difficulty by varying the number of additional maskers presented at other spatial locations in the horizontal soundfield (0–1–2–3–4–6–8 multi-talkers). Musicians obtained faster and better speech recognition amidst up to around eight simultaneous talkers and showed less noise-related decline in performance with increasing interferers than their non-musician peers. Correlations revealed associations between listeners’ years of musical training and CRM recognition and working memory. However, better working memory correlated with better speech streaming. Basic (QuickSIN) but not more complex (speech streaming) SIN processing was still predicted by music training after controlling for working memory. Our findings confirm a relationship between musicianship and naturalistic cocktail party speech streaming but also suggest that cognitive factors at least partially drive musicians’ SIN advantage.
Collapse
Affiliation(s)
- Gavin M Bidelman
- Institute for Intelligent Systems, University of Memphis, Memphis, TN, United States.,School of Communication Sciences and Disorders, University of Memphis, Memphis, TN, United States.,Department of Anatomy and Neurobiology, University of Tennessee Health Sciences Center, Memphis, TN, United States
| | - Jessica Yoo
- School of Communication Sciences and Disorders, University of Memphis, Memphis, TN, United States
| |
Collapse
|
33
|
Perceptual fusion of musical notes by native Amazonians suggests universal representations of musical intervals. Nat Commun 2020; 11:2786. [PMID: 32493923 PMCID: PMC7270137 DOI: 10.1038/s41467-020-16448-6] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 04/23/2020] [Indexed: 01/31/2023] Open
Abstract
Music perception is plausibly constrained by universal perceptual mechanisms adapted to natural sounds. Such constraints could arise from our dependence on harmonic frequency spectra for segregating concurrent sounds, but evidence has been circumstantial. We measured the extent to which concurrent musical notes are misperceived as a single sound, testing Westerners as well as native Amazonians with limited exposure to Western music. Both groups were more likely to mistake note combinations related by simple integer ratios as single sounds (‘fusion’). Thus, even with little exposure to Western harmony, acoustic constraints on sound segregation appear to induce perceptual structure on note combinations. However, fusion did not predict aesthetic judgments of intervals in Westerners, or in Amazonians, who were indifferent to consonance/dissonance. The results suggest universal perceptual mechanisms that could help explain cross-cultural regularities in musical systems, but indicate that these mechanisms interact with culture-specific influences to produce musical phenomena such as consonance. Music varies across cultures, but some features are widespread, consistent with biological constraints. Here, the authors report that both Western and native Amazonian listeners perceptually fuse concurrent notes related by simple-integer ratios, suggestive of one such biological constraint.
Collapse
|
34
|
Auditory Selectivity for Spectral Contrast in Cortical Neurons and Behavior. J Neurosci 2019; 40:1015-1027. [PMID: 31826944 DOI: 10.1523/jneurosci.1200-19.2019] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Revised: 12/04/2019] [Accepted: 12/06/2019] [Indexed: 12/17/2022] Open
Abstract
Vocal communication relies on the ability of listeners to identify, process, and respond to vocal sounds produced by others in complex environments. To accurately recognize these signals, animals' auditory systems must robustly represent acoustic features that distinguish vocal sounds from other environmental sounds. Vocalizations typically have spectral structure; power regularly fluctuates along the frequency axis, creating spectral contrast. Spectral contrast is closely related to harmonicity, which refers to spectral power peaks occurring at integer multiples of a fundamental frequency. Although both spectral contrast and harmonicity typify natural sounds, they may differ in salience for communication behavior and engage distinct neural mechanisms. Therefore, it is important to understand which of these properties of vocal sounds underlie the neural processing and perception of vocalizations.Here, we test the importance of vocalization-typical spectral features in behavioral recognition and neural processing of vocal sounds, using male zebra finches. We show that behavioral responses to natural and synthesized vocalizations rely on the presence of discrete frequency components, but not on harmonic ratios between frequencies. We identify a specific population of neurons in primary auditory cortex that are sensitive to the spectral resolution of vocal sounds. We find that behavioral and neural response selectivity is explained by sensitivity to spectral contrast rather than harmonicity. This selectivity emerges within the cortex; it is absent in the thalamorecipient region and present in the deep output region. Further, deep-region neurons that are contrast-sensitive show distinct temporal responses and selectivity for modulation density compared with unselective neurons.SIGNIFICANCE STATEMENT Auditory coding and perception are critical for vocal communication. Auditory neurons must encode acoustic features that distinguish vocalizations from other sounds in the environment and generate percepts that direct behavior. The acoustic features that drive neural and behavioral selectivity for vocal sounds are unknown, however. Here, we show that vocal response behavior scales with stimulus spectral contrast but not with harmonicity, in songbirds. We identify a distinct population of auditory cortex neurons in which response selectivity parallels behavioral selectivity. This neural response selectivity is explained by sensitivity to spectral contrast rather than to harmonicity. Our findings inform the understanding of how the auditory system encodes socially-relevant signals via detection of an acoustic feature that is ubiquitous in vocalizations.
Collapse
|
35
|
Młynarski W, McDermott JH. Ecological origins of perceptual grouping principles in the auditory system. Proc Natl Acad Sci U S A 2019; 116:25355-25364. [PMID: 31754035 PMCID: PMC6911196 DOI: 10.1073/pnas.1903887116] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Events and objects in the world must be inferred from sensory signals to support behavior. Because sensory measurements are temporally and spatially local, the estimation of an object or event can be viewed as the grouping of these measurements into representations of their common causes. Perceptual grouping is believed to reflect internalized regularities of the natural environment, yet grouping cues have traditionally been identified using informal observation and investigated using artificial stimuli. The relationship of grouping to natural signal statistics has thus remained unclear, and additional or alternative cues remain possible. Here, we develop a general methodology for relating grouping to natural sensory signals and apply it to derive auditory grouping cues from natural sounds. We first learned local spectrotemporal features from natural sounds and measured their co-occurrence statistics. We then learned a small set of stimulus properties that could predict the measured feature co-occurrences. The resulting cues included established grouping cues, such as harmonic frequency relationships and temporal coincidence, but also revealed previously unappreciated grouping principles. Human perceptual grouping was predicted by natural feature co-occurrence, with humans relying on the derived grouping cues in proportion to their informativity about co-occurrence in natural sounds. The results suggest that auditory grouping is adapted to natural stimulus statistics, show how these statistics can reveal previously unappreciated grouping phenomena, and provide a framework for studying grouping in natural signals.
Collapse
Affiliation(s)
- Wiktor Młynarski
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139;
- Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139;
- Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA 02139
- Program in Speech and Hearing Biosciences and Technology, Harvard University, Boston, MA 02115
| |
Collapse
|
36
|
Norman-Haignere SV, Kanwisher N, McDermott JH, Conway BR. Divergence in the functional organization of human and macaque auditory cortex revealed by fMRI responses to harmonic tones. Nat Neurosci 2019; 22:1057-1060. [PMID: 31182868 PMCID: PMC6592717 DOI: 10.1038/s41593-019-0410-7] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2018] [Accepted: 04/19/2019] [Indexed: 12/02/2022]
Abstract
We report a difference between humans and macaque monkeys in the functional organization of cortical regions implicated in pitch perception: humans but not macaques showed regions with a strong preference for harmonic sounds compared to noise, measured with both synthetic tones and macaque vocalizations. In contrast, frequency-selective tonotopic maps were similar between the two species. This species difference may be driven by the unique demands of speech and music perception in humans.
Collapse
Affiliation(s)
- Sam V Norman-Haignere
- Zuckerman Institute for Mind, Brain and Behavior, Columbia University, New York, NY, USA. .,Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA. .,HHMI Postdoctoral Fellow of the Life Sciences Research Institute, Chevy Chase, MD, USA. .,Laboratoire des Systèmes Perceptifs, Département d'Études Cognitives, École Normale Supérieure, PSL University, CNRS, Paris, France.
| | - Nancy Kanwisher
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA.,McGovern Institute for Brain Research, Cambridge, MA, USA.,Center for Minds, Brains and Machines, Cambridge, MA, USA
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA.,McGovern Institute for Brain Research, Cambridge, MA, USA.,Center for Minds, Brains and Machines, Cambridge, MA, USA.,Program in Speech and Hearing Biosciences and Technology, Harvard University, Cambridge, MA, USA
| | - Bevil R Conway
- Laboratory of Sensorimotor Research, NEI, NIH, Bethesda, MD, USA. .,National Institute of Mental Health, NIH, Bethesda, MD, USA. .,National Institute of Neurological Disease and Stroke, NIH, Bethesda, MD, USA.
| |
Collapse
|
37
|
Walker KM, Gonzalez R, Kang JZ, McDermott JH, King AJ. Across-species differences in pitch perception are consistent with differences in cochlear filtering. eLife 2019; 8:41626. [PMID: 30874501 PMCID: PMC6435318 DOI: 10.7554/elife.41626] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2018] [Accepted: 03/14/2019] [Indexed: 11/13/2022] Open
Abstract
Pitch perception is critical for recognizing speech, music and animal vocalizations, but its neurobiological basis remains unsettled, in part because of divergent results across species. We investigated whether species-specific differences exist in the cues used to perceive pitch and whether these can be accounted for by differences in the auditory periphery. Ferrets accurately generalized pitch discriminations to untrained stimuli whenever temporal envelope cues were robust in the probe sounds, but not when resolved harmonics were the main available cue. By contrast, human listeners exhibited the opposite pattern of results on an analogous task, consistent with previous studies. Simulated cochlear responses in the two species suggest that differences in the relative salience of the two pitch cues can be attributed to differences in cochlear filter bandwidths. The results support the view that cross-species variation in pitch perception reflects the constraints of estimating a sound’s fundamental frequency given species-specific cochlear tuning.
Collapse
Affiliation(s)
- Kerry Mm Walker
- Department of Physiology, Anatomy & Genetics, University of Oxford, Oxford, United Kingdom
| | - Ray Gonzalez
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, United States
| | - Joe Z Kang
- Department of Physiology, Anatomy & Genetics, University of Oxford, Oxford, United Kingdom
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, United States.,Program in Speech and Hearing Biosciences and Technology, Harvard University, Cambridge, United States
| | - Andrew J King
- Department of Physiology, Anatomy & Genetics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
38
|
Hamilton LS, Huth AG. The revolution will not be controlled: natural stimuli in speech neuroscience. LANGUAGE, COGNITION AND NEUROSCIENCE 2018; 35:573-582. [PMID: 32656294 PMCID: PMC7324135 DOI: 10.1080/23273798.2018.1499946] [Citation(s) in RCA: 131] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 02/21/2018] [Accepted: 07/03/2018] [Indexed: 05/22/2023]
Abstract
Humans have a unique ability to produce and consume rich, complex, and varied language in order to communicate ideas to one another. Still, outside of natural reading, the most common methods for studying how our brains process speech or understand language use only isolated words or simple sentences. Recent studies have upset this status quo by employing complex natural stimuli and measuring how the brain responds to language as it is used. In this article we argue that natural stimuli offer many advantages over simplified, controlled stimuli for studying how language is processed by the brain. Furthermore, the downsides of using natural language stimuli can be mitigated using modern statistical and computational techniques.
Collapse
Affiliation(s)
- Liberty S. Hamilton
- Communication Sciences & Disorders, Moody College of Communication, The University of Texas at Austin, Austin, USA
- Department of Neurology, Dell Medical School, The University of Texas at Austin, Austin, USA
| | - Alexander G. Huth
- Department of Neuroscience, The University of Texas at Austin, Austin, USA
- Department of Computer Science, The University of Texas at Austin, Austin, USA
| |
Collapse
|