1
|
Shen J, Wu J. Recognition of Speech With Dynamic Pitch Manipulation in Noise: Effects of Manipulation Methods. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2024; 67:269-281. [PMID: 37983169 PMCID: PMC11000783 DOI: 10.1044/2023_jslhr-23-00142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 06/26/2023] [Accepted: 09/27/2023] [Indexed: 11/22/2023]
Abstract
PURPOSE Dynamic pitch, which is defined as the variation in fundamental frequency in speech, is one of the acoustic cues that affect speech recognition in noise. Built on the evidence that a symmetrical manipulation of dynamic pitch led to poorer speech recognition, the present study examined the effect of an asymmetrical manipulation method on speech recognition in noise by younger and older adults. METHOD Speech recognition accuracy in noise was measured from younger adults with normal hearing in Experiment 1, and speech reception threshold (in dB SNR) from older adults with normal hearing to mild-moderate hearing loss in Experiment 2. The dynamic pitch contours of the speech stimuli were manipulated using both symmetrical and asymmetrical methods. RESULTS Younger adults recognized speech better in noise with asymmetrical than symmetrical manipulation, and with weakened than strengthened dynamic pitch. A substantial amount of variability was observed in a group of older listeners. This variability was predominately predicted by the listeners' age but not hearing thresholds or their ability to perceive dynamic pitch in fluctuating noise. CONCLUSIONS The asymmetrical manipulation of dynamic pitch had a less negative effect than the symmetrical manipulation. This effect also interacted with pitch-change direction. These findings suggest the influence of perceptual naturalness on speech recognition with signal modification. Directions for future research are also discussed.
Collapse
Affiliation(s)
- Jing Shen
- Department of Communication Sciences and Disorders, College of Public Health, Temple University, Philadelphia, PA
| | - Jingwei Wu
- Department of Epidemiology and Biostatistics, College of Public Health, Temple University, Philadelphia, PA
| |
Collapse
|
2
|
Wasiuk PA, Calandruccio L, Oleson JJ, Buss E. Predicting speech-in-speech recognition: Short-term audibility and spatial separation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 154:1827-1837. [PMID: 37728286 DOI: 10.1121/10.0021069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 08/28/2023] [Indexed: 09/21/2023]
Abstract
Quantifying the factors that predict variability in speech-in-speech recognition represents a fundamental challenge in auditory science. Stimulus factors associated with energetic and informational masking (IM) modulate variability in speech-in-speech recognition, but energetic effects can be difficult to estimate in spectro-temporally dynamic speech maskers. The current experiment characterized the effects of short-term audibility and differences in target and masker location (or perceived location) on the horizontal plane for sentence recognition in two-talker speech. Thirty young adults with normal hearing (NH) participated. Speech reception thresholds and keyword recognition at a fixed signal-to-noise ratio (SNR) were measured in each spatial condition. Short-term audibility for each keyword was quantified using a glimpsing model. Results revealed that speech-in-speech recognition depended on the proportion of audible glimpses available in the target + masker keyword stimulus in each spatial condition, even across stimuli presented at a fixed global SNR. Short-term audibility requirements were greater for colocated than spatially separated speech-in-speech recognition, and keyword recognition improved more rapidly as a function of increases in target audibility with spatial separation. Results indicate that spatial cues enhance glimpsing efficiency in competing speech for young adults with NH and provide a quantitative framework for estimating IM for speech-in-speech recognition in different spatial configurations.
Collapse
Affiliation(s)
- Peter A Wasiuk
- Department of Communication Disorders, 493 Fitch Street, Southern Connecticut State University, New Haven, Connecticut 06515, USA
| | - Lauren Calandruccio
- Department of Psychological Sciences, 11635 Euclid Avenue, Case Western Reserve University, Cleveland, Ohio 44106, USA
| | - Jacob J Oleson
- Department of Biostatistics, 145 North Riverside Drive N300, College of Public Health, University of Iowa, Iowa City, Iowa 52242, USA
| | - Emily Buss
- Department of Otolaryngology/Head and Neck Surgery, 170 Manning Drive, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA
| |
Collapse
|
3
|
Mesiano PA, Zaar J, Bramslw L, Relaño-Iborra H, Dau T. The Role of Average Fundamental Frequency Difference on the Intelligibility of Real-Life Competing Sentences. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2023:1-14. [PMID: 37390502 DOI: 10.1044/2023_jslhr-22-00219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2023]
Abstract
PURPOSE The average fundamental frequency separation (∆fo) between two competing voices has been shown to provide an important cue for target-speech intelligibility. However, some of the previous investigations used speech materials with linguistic properties and fo characteristics that may not be typical of realistic acoustic scenarios. This study investigated to what extent the effect of ∆fo generalizes to more real-life speech. METHODS Real-life sentences and a well-controlled method for manipulating the acoustic stimuli were employed. Fifteen young normal-hearing native Danish listeners were tested in a two-competing-voices sentence recognition task at several target-to-masker ratios (TMRs) and ∆fos. RESULTS Compared to previous studies that addressed the same experimental scenario with less realistic speech materials, the present results showed only a moderate effect of ∆fo at negative TMRs and a negligible effect at positive TMRs. An analysis of the employed stimuli showed that a large ∆fo effect on the target speech intelligibility is only observed when the competing sentences have highly synchronous fo trajectories, which is typical of the artificial speech materials employed in previous studies. CONCLUSION Overall, the present results suggest a relatively small effect of ∆fo on the intelligibility of real-life speech, as compared to previously employed artificial speech, in two-competing-sentences conditions.
Collapse
Affiliation(s)
- Paolo A Mesiano
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby
| | - Johannes Zaar
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby
- Eriksholm Research Centre, Helsingør, Denmark
| | | | - Helia Relaño-Iborra
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby
| | - Torsten Dau
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby
| |
Collapse
|
4
|
Flaherty MM, Buss E, Libert K. Effects of Target and Masker Fundamental Frequency Contour Depth on School-Age Children's Speech Recognition in a Two-Talker Masker. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2023; 66:400-414. [PMID: 36580582 DOI: 10.1044/2022_jslhr-22-00207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
PURPOSE Maturation of the ability to recognize target speech in the presence of a two-talker speech masker extends into early adolescence. This study evaluated whether children benefit from differences in fundamental frequency (f o) contour depth between the target and masker speech, a cue that has been shown to improve recognition in adults. METHOD Speech stimuli were recorded from talkers using three speaking styles, with f o contour depths that were Flat, Normal, or Exaggerated. Targets were open-set, declarative sentences produced by a female talker, and maskers were two streams of concatenated sentences produced by a second female talker. Listeners were children (ages 5-17 years) and adults (ages 18-24 years) with normal hearing. Each listener was tested in one of the three masker styles paired with all three target styles. Speech recognition thresholds (SRTs) corresponding to 50% correct were estimated by fitting psychometric functions to adaptive track data. RESULTS For adults, performance did not differ significantly across conditions with matched speaking styles. A mismatch benefit was observed when combining Flat targets with the Exaggerated masker and Exaggerated targets with the Flat masker, and for both Flat and Exaggerated targets paired with the Normal masker. For children, there was a significant effect of age in all conditions. Flat targets in the Flat masker were associated with lower SRTs than the other two matched conditions, and a mismatch benefit was observed for young children only when the target f o contour was less variable than the masker f o contour. CONCLUSIONS Whereas child-directed speech often has exaggerated pitch contours, young children were better able to recognize speech with less variable f o. Age effects were observed in the benefit of mismatched speaking styles for some conditions, which could be related to differences in baseline SRTs rather than differences in segregation abilities.
Collapse
Affiliation(s)
- Mary M Flaherty
- Department of Speech and Hearing Science, University of Illinois at Urbana-Champaign, Champaign
| | - Emily Buss
- Department of Otolaryngology/Head and Neck Surgery, The University of North Carolina at Chapel Hill
| | - Kelsey Libert
- Department of Speech and Hearing Science, University of Illinois at Urbana-Champaign, Champaign
| |
Collapse
|
5
|
Wasiuk PA, Buss E, Oleson JJ, Calandruccio L. Predicting speech-in-speech recognition: Short-term audibility, talker sex, and listener factors. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:3010. [PMID: 36456289 DOI: 10.1121/10.0015228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Accepted: 11/01/2022] [Indexed: 06/17/2023]
Abstract
Speech-in-speech recognition can be challenging, and listeners vary considerably in their ability to accomplish this complex auditory-cognitive task. Variability in performance can be related to intrinsic listener factors as well as stimulus factors associated with energetic and informational masking. The current experiments characterized the effects of short-term audibility of the target, differences in target and masker talker sex, and intrinsic listener variables on sentence recognition in two-talker speech and speech-shaped noise. Participants were young adults with normal hearing. Each condition included the adaptive measurement of speech reception thresholds, followed by testing at a fixed signal-to-noise ratio (SNR). Short-term audibility for each keyword was quantified using a computational glimpsing model for target+masker mixtures. Scores on a psychophysical task of auditory stream segregation predicted speech recognition, with stronger effects for speech-in-speech than speech-in-noise. Both speech-in-speech and speech-in-noise recognition depended on the proportion of audible glimpses available in the target+masker mixture, even across stimuli presented at the same global SNR. Short-term audibility requirements varied systematically across stimuli, providing an estimate of the greater informational masking for speech-in-speech than speech-in-noise recognition and quantifying informational masking for matched and mismatched talker sex.
Collapse
Affiliation(s)
- Peter A Wasiuk
- Department of Psychological Sciences, 11635 Euclid Avenue, Case Western Reserve University, Cleveland, Ohio 44106, USA
| | - Emily Buss
- Department of Otolaryngology/Head and Neck Surgery, 170 Manning Drive, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA
| | - Jacob J Oleson
- Department of Biostatistics, 145 North Riverside Drive, University of Iowa, Iowa City, Iowa 52242, USA
| | - Lauren Calandruccio
- Department of Psychological Sciences, 11635 Euclid Avenue, Case Western Reserve University, Cleveland, Ohio 44106, USA
| |
Collapse
|
6
|
Shen J, Fitzgerald LP, Kulick ER. Interactions between acoustic challenges and processing depth in speech perception as measured by task-evoked pupil response. Front Psychol 2022; 13:959638. [PMID: 36389464 PMCID: PMC9641013 DOI: 10.3389/fpsyg.2022.959638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 09/12/2022] [Indexed: 08/21/2023] Open
Abstract
Speech perception under adverse conditions is a multistage process involving a dynamic interplay among acoustic, cognitive, and linguistic factors. Nevertheless, prior research has primarily focused on factors within this complex system in isolation. The primary goal of the present study was to examine the interaction between processing depth and the acoustic challenge of noise and its effect on processing effort during speech perception in noise. Two tasks were used to represent different depths of processing. The speech recognition task involved repeating back a sentence after auditory presentation (higher-level processing), while the tiredness judgment task entailed a subjective judgment of whether the speaker sounded tired (lower-level processing). The secondary goal of the study was to investigate whether pupil response to alteration of dynamic pitch cues stems from difficult linguistic processing of speech content in noise or a perceptual novelty effect due to the unnatural pitch contours. Task-evoked peak pupil response from two groups of younger adult participants with typical hearing was measured in two experiments. Both tasks (speech recognition and tiredness judgment) were implemented in both experiments, and stimuli were presented with background noise in Experiment 1 and without noise in Experiment 2. Increased peak pupil dilation was associated with deeper processing (i.e., the speech recognition task), particularly in the presence of background noise. Importantly, there is a non-additive interaction between noise and task, as demonstrated by the heightened peak pupil dilation to noise in the speech recognition task as compared to in the tiredness judgment task. Additionally, peak pupil dilation data suggest dynamic pitch alteration induced an increased perceptual novelty effect rather than reflecting effortful linguistic processing of the speech content in noise. These findings extend current theories of speech perception under adverse conditions by demonstrating that the level of processing effort expended by a listener is influenced by the interaction between acoustic challenges and depth of linguistic processing. The study also provides a foundation for future work to investigate the effects of this complex interaction in clinical populations who experience both hearing and cognitive challenges.
Collapse
Affiliation(s)
- Jing Shen
- Department of Communication Sciences and Disorders, College of Public Health, Temple University, Philadelphia, PA, United States
| | - Laura P. Fitzgerald
- Department of Communication Sciences and Disorders, College of Public Health, Temple University, Philadelphia, PA, United States
| | - Erin R. Kulick
- Department of Epidemiology and Biostatistics, College of Public Health, Temple University, Philadelphia, PA, United States
| |
Collapse
|
7
|
Buss E, Miller MK, Leibold LJ. Maturation of Speech-in-Speech Recognition for Whispered and Voiced Speech. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2022; 65:3117-3128. [PMID: 35868232 PMCID: PMC9911131 DOI: 10.1044/2022_jslhr-21-00620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 04/01/2022] [Accepted: 04/29/2022] [Indexed: 06/15/2023]
Abstract
PURPOSE Some speech recognition data suggest that children rely less on voice pitch and harmonicity to support auditory scene analysis than adults. Two experiments evaluated development of speech-in-speech recognition using voiced speech and whispered speech, which lacks the harmonic structure of voiced speech. METHOD Listeners were 5- to 7-year-olds and adults with normal hearing. Targets were monosyllabic words organized into three-word sets that differ in vowel content. Maskers were two-talker or one-talker streams of speech. Targets and maskers were recorded by different female talkers in both voiced and whispered speaking styles. For each masker, speech reception thresholds (SRTs) were measured in all four combinations of target and masker speech, including matched and mismatched speaking styles for the target and masker. RESULTS Children performed more poorly than adults overall. For the two-talker masker, this age effect was smaller for the whispered target and masker than for the other three conditions. Children's SRTs in this condition were predominantly positive, suggesting that they may have relied on a wholistic listening strategy rather than segregating the target from the masker. For the one-talker masker, age effects were consistent across the four conditions. Reduced informational masking for the one-talker masker could be responsible for differences in age effects for the two maskers. A benefit of mismatching the target and masker speaking style was observed for both target styles in the two-talker masker and for the voiced targets in the one-talker masker. CONCLUSIONS These results provide no compelling evidence that young school-age children and adults are differentially sensitive to the cues present in voiced and whispered speech. Both groups benefit from mismatches in speaking style under some conditions. These benefits could be due to a combination of reduced perceptual similarity, harmonic cancelation, and differences in energetic masking.
Collapse
Affiliation(s)
- Emily Buss
- Department of Otolaryngology-Head and Neck Surgery, University of North Carolina at Chapel Hill
| | - Margaret K. Miller
- Center for Hearing Research, Boys Town National Research Hospital, Omaha, NE
| | - Lori J. Leibold
- Center for Hearing Research, Boys Town National Research Hospital, Omaha, NE
| |
Collapse
|
8
|
Brown VA, Dillman-Hasso NH, Li Z, Ray L, Mamantov E, Van Engen KJ, Strand JF. Revisiting the target-masker linguistic similarity hypothesis. Atten Percept Psychophys 2022; 84:1772-1787. [PMID: 35474415 PMCID: PMC10701341 DOI: 10.3758/s13414-022-02486-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/25/2022] [Indexed: 02/01/2023]
Abstract
The linguistic similarity hypothesis states that it is more difficult to segregate target and masker speech when they are linguistically similar. For example, recognition of English target speech should be more impaired by the presence of Dutch masking speech than Mandarin masking speech because Dutch and English are more linguistically similar than Mandarin and English. Across four experiments, English target speech was consistently recognized more poorly when presented in English masking speech than in silence, speech-shaped noise, or an unintelligible masker (i.e., Dutch or Mandarin). However, we found no evidence for graded masking effects-Dutch did not impair performance more than Mandarin in any experiment, despite 650 participants being tested. This general pattern was consistent when using both a cross-modal paradigm (in which target speech was lipread and maskers were presented aurally; Experiments 1a and 1b) and an auditory-only paradigm (in which both the targets and maskers were presented aurally; Experiments 2a and 2b). These findings suggest that the linguistic similarity hypothesis should be refined to reflect the existing evidence: There is greater release from masking when the masker language differs from the target speech than when it is the same as the target speech. However, evidence that unintelligible maskers impair speech identification to a greater extent when they are more linguistically similar to the target language remains elusive.
Collapse
Affiliation(s)
- Violet A Brown
- Department of Psychological and Brain Sciences, Washington University in St. Louis, One Brookings Drive, St. Louis, MO, 63130, USA.
| | - Naseem H Dillman-Hasso
- Carleton College, Department of Psychology, One North College St, Northfield, MN, 55057, USA
| | - ZhaoBin Li
- Carleton College, Department of Psychology, One North College St, Northfield, MN, 55057, USA
| | - Lucia Ray
- Carleton College, Department of Psychology, One North College St, Northfield, MN, 55057, USA
| | - Ellen Mamantov
- Carleton College, Department of Psychology, One North College St, Northfield, MN, 55057, USA
| | - Kristin J Van Engen
- Department of Psychological and Brain Sciences, Washington University in St. Louis, One Brookings Drive, St. Louis, MO, 63130, USA
| | - Julia F Strand
- Carleton College, Department of Psychology, One North College St, Northfield, MN, 55057, USA
| |
Collapse
|
9
|
Analysis Model of Spoken English Evaluation Algorithm Based on Intelligent Algorithm of Internet of Things. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:8469945. [PMID: 35387241 PMCID: PMC8977287 DOI: 10.1155/2022/8469945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 01/23/2022] [Accepted: 01/28/2022] [Indexed: 11/18/2022]
Abstract
With the in-depth promotion of the national strategy for the integration of artificial intelligence technology and entity development, speech recognition processing technology, as an important medium of human-computer interaction, has received extensive attention and motivated research in industry and academia. However, the existing accurate speech recognition products are based on massive data platform, which has the problems of slow response and security risk, which makes it difficult for the existing speech recognition products to meet the application requirements for timely translation of speech with high response time and network security requirements under the condition of network instability and insecurity. Based on this, this paper studies the analysis model of oral English evaluation algorithm based on Internet of things intelligent algorithm in speech recognition technology. Firstly, based on the automatic machine learning and lightweight learning strategy, a lightweight technology of automatic speech recognition depth neural network adapted to the edge computing power is proposed. Secondly, the quantitative evaluation of Internet of things intelligent classification algorithm and big data analysis in this system is described. In the evaluation, the evaluation method of oral English characteristics is adopted. At the same time, the Internet of things intelligent classification algorithm and big data analysis strategy are used to evaluate the accuracy of oral English. Finally, the experimental results show that the oral English feature recognition system based on Internet of things intelligent classification algorithm and big data analysis has the advantages of good reliability, high intelligence, and strong ability to resist subjective factors, which proves the advantages of Internet of things intelligent classification algorithm and big data analysis in English feature recognition.
Collapse
|
10
|
Abstract
Identification of speech from a "target" talker was measured in a speech-on-speech
masking task with two simultaneous "masker" talkers. The overall level of each talker was
either fixed or randomized throughout each stimulus presentation to investigate the
effectiveness of level as a cue for segregating competing talkers and attending to the
target. Experimental manipulations included varying the level difference between talkers
and imposing three types of target level uncertainty: 1) fixed target level across trials,
2) random target level across trials, or 3) random target levels on a word-by-word basis
within a trial. When the target level was predictable performance was better than
corresponding conditions when the target level was uncertain. Masker confusions were
consistent with a high degree of informational masking (IM). Furthermore, evidence was
found for "tuning" in level and a level "release" from IM. These findings suggest that
conforming to listener expectation about relative level, in addition to cues signaling
talker identity, facilitates segregation of, and maintaining focus of attention on, a
specific talker in multiple-talker communication situations.
Collapse
Affiliation(s)
- Andrew J Byrne
- Department of Speech, Language, & Hearing Sciences, 1846Boston University, MA, USA
| | - Christopher Conroy
- Department of Speech, Language, & Hearing Sciences, 1846Boston University, MA, USA
| | - Gerald Kidd
- Department of Speech, Language, & Hearing Sciences, 1846Boston University, MA, USA.,Department of Otolaryngology, Head-Neck Surgery, Medical University of South Carolina, Charleston, SC, USA
| |
Collapse
|
11
|
Shen J. Pupillary response to dynamic pitch alteration during speech perception in noise. JASA EXPRESS LETTERS 2021; 1:115202. [PMID: 34778875 PMCID: PMC8574131 DOI: 10.1121/10.0007056] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 10/12/2021] [Indexed: 06/13/2023]
Abstract
Dynamic pitch, also known as intonation, conveys both semantic and pragmatic meaning in speech communication. While alteration of this cue is detrimental to speech intelligibility in noise, the mechanism involved is poorly understood. Using the psychophysiological measure of task-evoked pupillary response, this study examined the perceptual effect of altered dynamic pitch cues on speech perception in noise. The data showed that pupil dilation increased with dynamic pitch strength in a sentence recognition in noise task. Taken together with recognition accuracy data, the results suggest the involvement of perceptual arousal in speech perception with dynamic pitch alteration.
Collapse
Affiliation(s)
- Jing Shen
- Department of Communication Sciences and Disorders, Temple University, 1701 North 13th Street, Philadelphia, Pennsylvania 19122, USA
| |
Collapse
|
12
|
Buss E, Bosen A. Band importance for speech-in-speech recognition. JASA EXPRESS LETTERS 2021; 1:084402. [PMID: 34661194 PMCID: PMC8499852 DOI: 10.1121/10.0005762] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 07/13/2021] [Indexed: 05/04/2023]
Abstract
Predicting masked speech perception typically relies on estimates of the spectral distribution of cues supporting recognition. Current methods for estimating band importance for speech-in-noise use filtered stimuli. These methods are not appropriate for speech-in-speech because filtering can modify stimulus features affecting auditory stream segregation. Here, band importance is estimated by quantifying the relationship between speech recognition accuracy for full-spectrum speech and the target-to-masker ratio by channel at the output of an auditory filterbank. Preliminary results provide support for this approach and indicate that frequencies below 2 kHz may contribute more to speech recognition in two-talker speech than in speech-shaped noise.
Collapse
Affiliation(s)
- Emily Buss
- Department of Otolaryngology/Head and Neck Surgery, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA
| | - Adam Bosen
- Center for Hearing Research, Boys Town National Research Hospital, Omaha, Nebraska 68131, USA ,
| |
Collapse
|
13
|
Liu JS, Liu YW, Yu YF, Galvin JJ, Fu QJ, Tao DD. Segregation of competing speech in adults and children with normal hearing and in children with cochlear implants. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:339. [PMID: 34340485 DOI: 10.1121/10.0005597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 06/22/2021] [Indexed: 06/13/2023]
Abstract
Children with normal hearing (CNH) have greater difficulty segregating competing speech than do adults with normal hearing (ANH). Children with cochlear implants (CCI) have greater difficulty segregating competing speech than do CNH. In the present study, speech reception thresholds (SRTs) in competing speech were measured in Chinese Mandarin-speaking ANH, CNH, and CCIs. Target sentences were produced by a male Mandarin-speaking talker. Maskers were time-forward or -reversed sentences produced by a native Mandarin-speaking male (different from the target) or female or a non-native English-speaking male. The SRTs were lowest (best) for the ANH group, followed by the CNH and CCI groups. The masking release (MR) was comparable between the ANH and CNH group, but much poorer in the CCI group. The temporal properties differed between the native and non-native maskers and between forward and reversed speech. The temporal properties of the maskers were significantly associated with the SRTs for the CCI and CNH groups but not for the ANH group. Whereas the temporal properties of the maskers were significantly associated with the MR for all three groups, the association was stronger for the CCI and CNH groups than for the ANH group.
Collapse
Affiliation(s)
- Ji-Sheng Liu
- Department of Ear, Nose, and Throat, The First Affiliated Hospital of Soochow University, Suzhou 215006, China
| | - Yang-Wenyi Liu
- Department of Otology and Skull Base Surgery, Eye Ear Nose and Throat Hospital, Fudan University, Shanghai 200031, China
| | - Ya-Feng Yu
- Department of Ear, Nose, and Throat, The First Affiliated Hospital of Soochow University, Suzhou 215006, China
| | - John J Galvin
- House Ear Institute, Los Angeles, California 90057, USA
| | - Qian-Jie Fu
- Department of Head and Neck Surgery, David Geffen School of Medicine, University of California Los Angeles (UCLA), Los Angeles, California 90095, USA
| | - Duo-Duo Tao
- Department of Ear, Nose, and Throat, The First Affiliated Hospital of Soochow University, Suzhou 215006, China
| |
Collapse
|
14
|
Jett B, Buss E, Best V, Oleson J, Calandruccio L. Does Sentence-Level Coarticulation Affect Speech Recognition in Noise or a Speech Masker? JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2021; 64:1390-1403. [PMID: 33784185 PMCID: PMC8608179 DOI: 10.1044/2021_jslhr-20-00450] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 12/04/2020] [Accepted: 01/05/2021] [Indexed: 06/12/2023]
Abstract
Purpose Three experiments were conducted to better understand the role of between-word coarticulation in masked speech recognition. Specifically, we explored whether naturally coarticulated sentences supported better masked speech recognition as compared to sentences derived from individually spoken concatenated words. We hypothesized that sentence recognition thresholds (SRTs) would be similar for coarticulated and concatenated sentences in a noise masker but would be better for coarticulated sentences in a speech masker. Method Sixty young adults participated (n = 20 per experiment). An adaptive tracking procedure was used to estimate SRTs in the presence of noise or two-talker speech maskers. Targets in Experiments 1 and 2 were matrix-style sentences, while targets in Experiment 3 were semantically meaningful sentences. All experiments included coarticulated and concatenated targets; Experiments 2 and 3 included a third target type, concatenated keyword-intensity-matched (KIM) sentences, in which the words were concatenated but individually scaled to replicate the intensity contours of the coarticulated sentences. Results Regression analyses evaluated the main effects of target type, masker type, and their interaction. Across all three experiments, effects of target type were small (< 2 dB). In Experiment 1, SRTs were slightly poorer for coarticulated than concatenated sentences. In Experiment 2, coarticulation facilitated speech recognition compared to the concatenated KIM condition. When listeners had access to semantic context (Experiment 3), a coarticulation benefit was observed in noise but not in the speech masker. Conclusions Overall, differences between SRTs for sentences with and without between-word coarticulation were small. Beneficial effects of coarticulation were only observed relative to the concatenated KIM targets; for unscaled concatenated targets, it appeared that consistent audibility across the sentence offsets any benefit of coarticulation. Contrary to our hypothesis, effects of coarticulation generally were not more pronounced in speech maskers than in noise maskers.
Collapse
Affiliation(s)
- Brandi Jett
- Department of Psychological Sciences, Case Western Reserve University, Cleveland, OH
| | - Emily Buss
- Department of Otolaryngology/Head and Neck Surgery, University of North Carolina at Chapel Hill
| | - Virginia Best
- Department of Speech, Language and Hearing Sciences, Boston University, MA
| | - Jacob Oleson
- Department of Biostatistics, University of Iowa, Iowa City
| | - Lauren Calandruccio
- Department of Psychological Sciences, Case Western Reserve University, Cleveland, OH
| |
Collapse
|
15
|
Shen J. Older Listeners' Perception of Speech With Strengthened and Weakened Dynamic Pitch Cues in Background Noise. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2021; 64:348-358. [PMID: 33439741 PMCID: PMC8632513 DOI: 10.1044/2020_jslhr-20-00116] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 07/28/2020] [Accepted: 09/21/2020] [Indexed: 06/12/2023]
Abstract
Purpose Dynamic pitch, which is defined as the variation in fundamental frequency, is an acoustic cue that aids speech perception in noise. This study examined the effects of strengthened and weakened dynamic pitch cues on older listeners' speech perception in noise, as well as how these effects were modulated by individual factors including spectral perception ability. Method The experiment measured speech reception thresholds in noise in both younger listeners with normal hearing and older listeners whose hearing status ranged from near-normal hearing to mild-to-moderate sensorineural hearing loss. The pitch contours of the target speech were manipulated to create four levels of dynamic pitch strength: weakened, original, mildly strengthened, and strengthened. Listeners' spectral perception ability was measured using tests of spectral ripple and frequency modulation discrimination. Results Both younger and older listeners performed worse with manipulated dynamic pitch cues than with original dynamic pitch. The effects of dynamic pitch on older listeners' speech recognition were associated with their age but not with their perception of spectral information. Those older listeners who were relatively younger were more negatively affected by dynamic pitch manipulations. Conclusions The findings suggest the current pitch manipulation strategy is detrimental for older listeners to perceive speech in noise, as compared to original dynamic pitch. While the influence of age on the effects of dynamic pitch is likely due to age-related declines in pitch perception, the spectral measures used in this study were not strong predictors for dynamic pitch effects. Taken together, these results indicate next steps in this line of work should be focused on how to manipulate acoustic cues in speech in order to improve speech perception in noise for older listeners.
Collapse
Affiliation(s)
- Jing Shen
- Department of Speech, Language and Hearing Sciences, Western Michigan University, Kalamazoo
| |
Collapse
|
16
|
Flaherty MM, Buss E, Leibold LJ. Independent and Combined Effects of Fundamental Frequency and Vocal Tract Length Differences for School-Age Children's Sentence Recognition in a Two-Talker Masker. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2021; 64:206-217. [PMID: 33375828 PMCID: PMC8610228 DOI: 10.1044/2020_jslhr-20-00327] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/08/2020] [Revised: 09/08/2020] [Accepted: 09/29/2020] [Indexed: 06/12/2023]
Abstract
Purpose The purpose of this study was to examine the independent and combined contributions of fundamental frequency (F0) and vocal tract length (VTL) differences on children's speech-in-speech recognition in the presence of a competing two-talker masker. Method Participants were 64 children (5-17 years old) and 25 adults (18-39 years old). Sentence recognition thresholds were measured in a two-talker masker. Target sentences had either the same mean F0 and VTL of the masker or were digitally altered so that the target and masker differed in F0 (Experiment 1), differed in VTL (Experiment 2), or differed in both F0 and VTL (Experiment 3). To determine the benefit, masking release was computed by subtracting thresholds in each shifted condition from the threshold in the unshifted condition. Results Results demonstrate that children's ability to benefit from either F0 or VTL differences (Experiments 1 and 2) depended on listener age, with younger children showing less improvement in speech reception thresholds compared to older children and adults. Age effects were also evident in the combined-cue conditions (Experiment 3), but children showed greater improvements compared to F0-only or VTL-only manipulations. Conclusions There was a prolonged pattern of development in children's ability to benefit from F0 or VTL differences between target and masker speech. Young children failed to capitalize on F0 and VTL differences to the same extent as older children and adults but did show a robust benefit when the cues were combined, supporting the hypothesis that younger children rely more heavily on redundant cues compared to older children and adults.
Collapse
Affiliation(s)
- Mary M. Flaherty
- Department of Speech and Hearing Science, University of Illinois at Urbana-Champaign
| | - Emily Buss
- Department of Otolaryngology/Head and Neck Surgery, School of Medicine, University of North Carolina at Chapel Hill
| | - Lori J. Leibold
- Center for Hearing Research, Boys Town National Research Hospital, Omaha, NE
| |
Collapse
|
17
|
Wasiuk PA, Lavandier M, Buss E, Oleson J, Calandruccio L. The effect of fundamental frequency contour similarity on multi-talker listening in older and younger adults. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 148:3527. [PMID: 33379934 PMCID: PMC7863686 DOI: 10.1121/10.0002661] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Older adults with hearing loss have greater difficulty recognizing target speech in multi-talker environments than young adults with normal hearing, especially when target and masker speech streams are perceptually similar. A difference in fundamental frequency (f0) contour depth is an effective stream segregation cue for young adults with normal hearing. This study examined whether older adults with varying degrees of sensorineural hearing loss are able to utilize differences in target/masker f0 contour depth to improve speech recognition in multi-talker listening. Speech recognition thresholds (SRTs) were measured for speech mixtures composed of target/masker streams with flat, normal, and exaggerated speaking styles, in which f0 contour depth systematically varied. Computational modeling estimated differences in energetic masking across listening conditions. Young adults had lower SRTs than older adults; a result that was partially explained by differences in audibility predicted by the model. However, audibility differences did not explain why young adults experienced a benefit from mismatched target/masker f0 contour depth, while in most conditions, older adults did not. Reduced ability to use segregation cues (differences in target/masker f0 contour depth), and deficits grouping speech with variable f0 contours likely contribute to difficulties experienced by older adults in challenging acoustic environments.
Collapse
Affiliation(s)
- Peter A Wasiuk
- Department of Psychological Sciences, 11635 Euclid Avenue, Case Western Reserve University, Cleveland, Ohio 44106, USA
| | - Mathieu Lavandier
- Univ. Lyon, ENTPE, Laboratoire Génie Civil et Bâtiment, Rue M. Audin, Vaulx-en-Velin Cedex, 69518, France
| | - Emily Buss
- Department of Otolaryngology/Head and Neck Surgery, University of North Carolina, CB#7070, Chapel Hill, North Carolina 27599, USA
| | - Jacob Oleson
- Department of Biostatistics, N300 CPHB, University of Iowa, 145 North Riverside Drive, Iowa City, Iowa 52242-2007, USA
| | - Lauren Calandruccio
- Department of Psychological Sciences, 11635 Euclid Avenue, Case Western Reserve University, Cleveland, Ohio 44106, USA
| |
Collapse
|
18
|
Bonino AY, Malley AR. Measuring open-set, word recognition in school-aged children: Corpus of monosyllabic target words and speech maskers. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:EL393. [PMID: 31671998 PMCID: PMC6910017 DOI: 10.1121/1.5130192] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Revised: 09/30/2019] [Accepted: 09/30/2019] [Indexed: 06/10/2023]
Abstract
A corpus of stimuli has been collected to support the use of common materials across research laboratories to examine school-aged children's word recognition in speech maskers. The corpus includes (1) 773 monosyllabic words that are known to be in the lexicon of 5- and 6-year-olds and (2) seven masker passages that are based on a first-grade child's writing samples. Materials were recorded by a total of 13 talkers (8 women; 5 men). All talkers recorded two masker passages; 3 talkers (2 women; 1 man) also recorded the target words. The annotated corpus is freely available online for research purposes.
Collapse
Affiliation(s)
- Angela Yarnell Bonino
- Department of Speech, Language, and Hearing Sciences, University of Colorado Boulder, Boulder, Colorado 80309, ,
| | - Ashley R Malley
- Department of Speech, Language, and Hearing Sciences, University of Colorado Boulder, Boulder, Colorado 80309, ,
| |
Collapse
|