1
|
Kong F, Zhou H, Zheng N, Meng Q. Sparse representation of speech using an atomic speech modela). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2025; 157:1899-1911. [PMID: 40106275 DOI: 10.1121/10.0036144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 02/20/2025] [Indexed: 03/22/2025]
Abstract
Speech perception has been extensively studied using degradation algorithms such as channel vocoding, mosaic speech, and pointillistic speech. Here, an "atomic speech model" is introduced to generate unique sparse time-frequency patterns. It processes speech signals using a bank of bandpass filters, undersamples the signals, and reproduces each sample using a Gaussian-enveloped tone (a Gabor atom). To examine atomic speech intelligibility, adaptive speech reception thresholds (SRTs) are measured as a function of atom rate in normal-hearing listeners, investigating the effects of spectral maxima, binaural integration, and single echo. Experiment 1 showed atomic speech with 4 spectral maxima out of 32 bands remained intelligible even at a low rate under 80 atoms per second. Experiment 2 showed that when atoms were nonoverlappingly assigned to both ears, the mean SRT increased (i.e., worsened) compared to the monaural condition, where all atoms were assigned to one ear. Individual data revealed that a few listeners could integrate information from both ears, performing comparably to the monaural condition. Experiment 3 indicated higher mean SRT with a 100 ms echo delay than that with shorter delays (e.g., 50, 25, and 0 ms). These findings demonstrate the utility of the atomic speech model for investigating speech perception and its underlying mechanisms.
Collapse
Affiliation(s)
- Fanhui Kong
- School of Information Engineering, Guangzhou Panyu Polytechnic, Guangzhou, Guangdong 410630, China
- Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, Guangdong 518052, China
| | - Huali Zhou
- School of Electronics and Information Engineering, Heyuan Polytechnic, Heyuan, Guangdong 517000, China
| | - Nengheng Zheng
- Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, Guangdong 518052, China
| | - Qinglin Meng
- Acoustics Laboratory, School of Physics and Optoelectronics, South China University of Technology, Guangzhou, Guangdong 510641, China
| |
Collapse
|
2
|
Swanborough H, Staib M, Frühholz S. Neurocognitive dynamics of near-threshold voice signal detection and affective voice evaluation. SCIENCE ADVANCES 2020; 6:6/50/eabb3884. [PMID: 33310844 PMCID: PMC7732184 DOI: 10.1126/sciadv.abb3884] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Accepted: 10/29/2020] [Indexed: 05/10/2023]
Abstract
Communication and voice signal detection in noisy environments are universal tasks for many species. The fundamental problem of detecting voice signals in noise (VIN) is underinvestigated especially in its temporal dynamic properties. We investigated VIN as a dynamic signal-to-noise ratio (SNR) problem to determine the neurocognitive dynamics of subthreshold evidence accrual and near-threshold voice signal detection. Experiment 1 showed that dynamic VIN, including a varying SNR and subthreshold sensory evidence accrual, is superior to similar conditions with nondynamic SNRs or with acoustically matched sounds. Furthermore, voice signals with affective meaning have a detection advantage during VIN. Experiment 2 demonstrated that VIN is driven by an effective neural integration in an auditory cortical-limbic network at and beyond the near-threshold detection point, which is preceded by activity in subcortical auditory nuclei. This demonstrates the superior recognition advantage of communication signals in dynamic noise contexts, especially when carrying socio-affective meaning.
Collapse
Affiliation(s)
- Huw Swanborough
- Cognitive and Affective Neuroscience Unit, Department of Psychology, University of Zurich, Zurich, Switzerland.
- Neuroscience Center Zurich, University of Zurich and ETH Zurich, Zurich, Switzerland
| | - Matthias Staib
- Cognitive and Affective Neuroscience Unit, Department of Psychology, University of Zurich, Zurich, Switzerland
- Neuroscience Center Zurich, University of Zurich and ETH Zurich, Zurich, Switzerland
| | - Sascha Frühholz
- Cognitive and Affective Neuroscience Unit, Department of Psychology, University of Zurich, Zurich, Switzerland.
- Neuroscience Center Zurich, University of Zurich and ETH Zurich, Zurich, Switzerland
- Department of Psychology, University of Oslo, Oslo, Norway
| |
Collapse
|
3
|
Varnet L, Langlet C, Lorenzi C, Lazard DS, Micheyl C. High-Frequency Sensorineural Hearing Loss Alters Cue-Weighting Strategies for Discriminating Stop Consonants in Noise. Trends Hear 2020; 23:2331216519886707. [PMID: 31722636 PMCID: PMC6856982 DOI: 10.1177/2331216519886707] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
There is increasing evidence that hearing-impaired (HI) individuals do not use the same listening strategies as normal-hearing (NH) individuals, even when wearing optimally fitted hearing aids. In this perspective, better characterization of individual perceptual strategies is an important step toward designing more effective speech-processing algorithms. Here, we describe two complementary approaches for (a) revealing the acoustic cues used by a participant in a /d/-/g/ categorization task in noise and (b) measuring the relative contributions of these cues to decision. These two approaches involve natural speech recordings altered by the addition of a “bump noise.” The bumps were narrowband bursts of noise localized on the spectrotemporal locations of the acoustic cues, allowing the experimenter to manipulate the consonant percept. The cue-weighting strategies were estimated for three groups of participants: 17 NH listeners, 18 HI listeners with high-frequency loss, and 15 HI listeners with flat loss. HI participants were provided with individual frequency-dependent amplification to compensate for their hearing loss. Although all listeners relied more heavily on the high-frequency cue than on the low-frequency cue, an important variability was observed in the individual weights, mostly explained by differences in internal noise. Individuals with high-frequency loss relied slightly less heavily on the high-frequency cue relative to the low-frequency cue, compared with NH individuals, suggesting a possible influence of supra-threshold deficits on cue-weighting strategies. Altogether, these results suggest a need for individually tailored speech-in-noise processing in hearing aids, if more effective speech discriminability in noise is to be achieved.
Collapse
Affiliation(s)
- Léo Varnet
- Laboratoire des systèmes perceptifs, Département d'études cognitives, École normale supérieure, Université Paris Sciences et Lettres, CNRS, Paris, France
| | - Chloé Langlet
- Laboratoire des systèmes perceptifs, Département d'études cognitives, École normale supérieure, Université Paris Sciences et Lettres, CNRS, Paris, France
| | - Christian Lorenzi
- Laboratoire des systèmes perceptifs, Département d'études cognitives, École normale supérieure, Université Paris Sciences et Lettres, CNRS, Paris, France
| | | | | |
Collapse
|
4
|
Cooke M, García Lecumberri ML. Sculpting speech from noise, music, and other sources. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 148:EL20. [PMID: 32752733 DOI: 10.1121/10.0001474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Accepted: 06/04/2020] [Indexed: 06/11/2023]
Abstract
Intelligible speech can be generated by passing a signal through a time-frequency mask that selects which information to retain, even when the signal is speech-shaped noise, suggesting an important role for the mask pattern itself. The current study examined the relationship between the signal and the mask by varying the availability of target speech cues in the signal while holding the mask constant. Keyword identification rates in everyday sentences varied from near-ceiling to near-floor levels as the signal was varied, indicating that the interaction between the signal and mask, rather than the mask alone, determines intelligibility.
Collapse
Affiliation(s)
- Martin Cooke
- Ikerbasque (Basque Science Foundation), Bilbao, Spain
| | | |
Collapse
|
5
|
Venezia JH, Martin AG, Hickok G, Richards VM. Identification of the Spectrotemporal Modulations That Support Speech Intelligibility in Hearing-Impaired and Normal-Hearing Listeners. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2019; 62:1051-1067. [PMID: 30986140 PMCID: PMC6802883 DOI: 10.1044/2018_jslhr-h-18-0045] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Purpose Age-related sensorineural hearing loss can dramatically affect speech recognition performance due to reduced audibility and suprathreshold distortion of spectrotemporal information. Normal aging produces changes within the central auditory system that impose further distortions. The goal of this study was to characterize the effects of aging and hearing loss on perceptual representations of speech. Method We asked whether speech intelligibility is supported by different patterns of spectrotemporal modulations (STMs) in older listeners compared to young normal-hearing listeners. We recruited 3 groups of participants: 20 older hearing-impaired (OHI) listeners, 19 age-matched normal-hearing listeners, and 10 young normal-hearing (YNH) listeners. Listeners performed a speech recognition task in which randomly selected regions of the speech STM spectrum were revealed from trial to trial. The overall amount of STM information was varied using an up-down staircase to hold performance at 50% correct. Ordinal regression was used to estimate weights showing which regions of the STM spectrum were associated with good performance (a "classification image" or CImg). Results The results indicated that (a) large-scale CImg patterns did not differ between the 3 groups; (b) weights in a small region of the CImg decreased systematically as hearing loss increased; (c) CImgs were also nonsystematically distorted in OHI listeners, and the magnitude of this distortion predicted speech recognition performance even after accounting for audibility; and (d) YNH listeners performed better overall than the older groups. Conclusion We conclude that OHI/older normal-hearing listeners rely on the same speech STMs as YNH listeners but encode this information less efficiently. Supplemental Material https://doi.org/10.23641/asha.7859981.
Collapse
Affiliation(s)
- Jonathan H. Venezia
- VA Loma Linda Healthcare System, CA
- Department of Otolaryngology, School of Medicine, Loma Linda University, Loma Linda, CA
| | | | - Gregory Hickok
- Department of Cognitive Sciences, University of California, Irvine
| | | |
Collapse
|
6
|
Burred JJ, Ponsot E, Goupil L, Liuni M, Aucouturier JJ. CLEESE: An open-source audio-transformation toolbox for data-driven experiments in speech and music cognition. PLoS One 2019; 14:e0205943. [PMID: 30947281 PMCID: PMC6448843 DOI: 10.1371/journal.pone.0205943] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2018] [Accepted: 02/15/2019] [Indexed: 11/29/2022] Open
Abstract
Over the past few years, the field of visual social cognition and face processing has been dramatically impacted by a series of data-driven studies employing computer-graphics tools to synthesize arbitrary meaningful facial expressions. In the auditory modality, reverse correlation is traditionally used to characterize sensory processing at the level of spectral or spectro-temporal stimulus properties, but not higher-level cognitive processing of e.g. words, sentences or music, by lack of tools able to manipulate the stimulus dimensions that are relevant for these processes. Here, we present an open-source audio-transformation toolbox, called CLEESE, able to systematically randomize the prosody/melody of existing speech and music recordings. CLEESE works by cutting recordings in small successive time segments (e.g. every successive 100 milliseconds in a spoken utterance), and applying a random parametric transformation of each segment’s pitch, duration or amplitude, using a new Python-language implementation of the phase-vocoder digital audio technique. We present here two applications of the tool to generate stimuli for studying intonation processing of interrogative vs declarative speech, and rhythm processing of sung melodies.
Collapse
Affiliation(s)
| | - Emmanuel Ponsot
- Science and Technology of Music and Sound (UMR9912, IRCAM/CNRS/Sorbonne Université), Paris, France
- Laboratoire des Systèmes Perceptifs (CNRS UMR 8248) and Département d’études cognitives, École Normale Supérieure, PSL Research University, Paris, France
| | - Louise Goupil
- Science and Technology of Music and Sound (UMR9912, IRCAM/CNRS/Sorbonne Université), Paris, France
| | - Marco Liuni
- Science and Technology of Music and Sound (UMR9912, IRCAM/CNRS/Sorbonne Université), Paris, France
| | - Jean-Julien Aucouturier
- Science and Technology of Music and Sound (UMR9912, IRCAM/CNRS/Sorbonne Université), Paris, France
- * E-mail:
| |
Collapse
|
7
|
Spille C, Kollmeier B, Meyer BT. Comparing human and automatic speech recognition in simple and complex acoustic scenes. COMPUT SPEECH LANG 2018. [DOI: 10.1016/j.csl.2018.04.003] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|