1
|
Johnson EM, Healy EW. An ideal compressed mask for increasing speech intelligibility without sacrificing environmental sound recognitiona). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 156:3958-3969. [PMID: 39666959 PMCID: PMC11646135 DOI: 10.1121/10.0034599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 10/23/2024] [Accepted: 11/18/2024] [Indexed: 12/14/2024]
Abstract
Hearing impairment is often characterized by poor speech-in-noise recognition. State-of-the-art laboratory-based noise-reduction technology can eliminate background sounds from a corrupted speech signal and improve intelligibility, but it can also hinder environmental sound recognition (ESR), which is essential for personal independence and safety. This paper presents a time-frequency mask, the ideal compressed mask (ICM), that aims to provide listeners with improved speech intelligibility without substantially reducing ESR. This is accomplished by limiting the maximum attenuation that the mask performs. Speech intelligibility and ESR for hearing-impaired and normal-hearing listeners were measured using stimuli that had been processed by ICMs with various levels of maximum attenuation. This processing resulted in significantly improved intelligibility while retaining high ESR performance for both types of listeners. It was also found that the same level of maximum attenuation provided the optimal balance of intelligibility and ESR for both listener types. It is argued that future deep-learning-based noise reduction algorithms may provide better outcomes by balancing the levels of the target speech and the background environmental sounds, rather than eliminating all signals except for the target speech. The ICM provides one such simple solution for frequency-domain models.
Collapse
Affiliation(s)
- Eric M Johnson
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric W Healy
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
2
|
Henry F, Glavin M, Jones E, Parsi A. Impact of Mask Type as Training Target for Speech Intelligibility and Quality in Cochlear-Implant Noise Reduction. SENSORS (BASEL, SWITZERLAND) 2024; 24:6614. [PMID: 39460094 PMCID: PMC11511210 DOI: 10.3390/s24206614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Revised: 10/06/2024] [Accepted: 10/10/2024] [Indexed: 10/28/2024]
Abstract
The selection of a target when training deep neural networks for speech enhancement is an important consideration. Different masks have been shown to exhibit different performance characteristics depending on the application and the conditions. This paper presents a comprehensive comparison of several different masks for noise reduction in cochlear implants. The study incorporated three well-known masks, namely the Ideal Binary Mask (IBM), Ideal Ratio Mask (IRM) and the Fast Fourier Transform Mask (FFTM), as well as two newly proposed masks, based on existing masks, called the Quantized Mask (QM) and the Phase-Sensitive plus Ideal Ratio Mask (PSM+). These five masks are used to train networks to estimate masks for the purpose of separating speech from noisy mixtures. A vocoder was used to simulate the behavior of a cochlear implant. Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) scores indicate that the two new masks proposed in this study (QM and PSM+) perform best for normal speech intelligibility and quality in the presence of stationary and non-stationary noise over a range of signal-to-noise ratios (SNRs). The Normalized Covariance Measure (NCM) and similarity scores indicate that they also perform best for speech intelligibility/gauging the similarity of vocoded speech. The Quantized Mask performs better than the Ideal Binary Mask due to its better resolution as it approximates the Wiener Gain Function. The PSM+ performs better than the three existing benchmark masks (IBM, IRM, and FFTM) as it incorporates both magnitude and phase information.
Collapse
Affiliation(s)
- Fergal Henry
- Department of Computing and Electronic Engineering, Atlantic Technological University, Ash Lane, F91YW50 Sligo, Ireland
| | - Martin Glavin
- Electrical and Electronic Engineering, University of Galway, University Road, H91TK33 Galway, Ireland; (M.G.); (E.J.); (A.P.)
| | - Edward Jones
- Electrical and Electronic Engineering, University of Galway, University Road, H91TK33 Galway, Ireland; (M.G.); (E.J.); (A.P.)
| | - Ashkan Parsi
- Electrical and Electronic Engineering, University of Galway, University Road, H91TK33 Galway, Ireland; (M.G.); (E.J.); (A.P.)
| |
Collapse
|
3
|
Brice S, Zakis J, Almond H. Changing Knowledge, Principles, and Technology in Contemporary Clinical Audiological Practice: A Narrative Review. J Clin Med 2024; 13:4538. [PMID: 39124804 PMCID: PMC11313557 DOI: 10.3390/jcm13154538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 07/23/2024] [Accepted: 07/31/2024] [Indexed: 08/12/2024] Open
Abstract
The field of audiology as a collection of auditory science knowledge, research, and clinical methods, technologies, and practices has seen great changes. A deeper understanding of psychological, cognitive, and behavioural interactions has led to a growing range of variables of interest to measure and track in diagnostic and rehabilitative processes. Technology-led changes to clinical practices, including teleaudiology, have heralded a call to action in order to recognise the role and impact of autonomy and agency on clinical practice, engagement, and outcomes. Advances in and new information on loudness models, tinnitus, psychoacoustics, deep neural networks, machine learning, predictive and adaptive algorithms, and PREMs/PROMs have enabled innovations in technology to revolutionise clinical principles and practices for the following: (i) assessment, (ii) fitting and programming of hearing devices, and (iii) rehabilitation. This narrative review will consider how the rise of teleaudiology as a growing and increasingly fundamental element of contemporary adult audiological practice has affected the principles and practices of audiology based on a new era of knowledge and capability. What areas of knowledge have grown? How has new knowledge shifted the priorities in clinical audiology? What technological innovations have been combined with these to change clinical practices? Above all, where is hearing loss now consequently positioned in its journey as a field of health and medicine?
Collapse
Affiliation(s)
- Sophie Brice
- Australian Institute of Health Service Management, COBE, University of Tasmania, Sandy Bay, Hobart, TAS 7001, Australia
- Institute of Health Management, 185-187 Boundary Road, North Melbourne, VIC 3051, Australia
| | - Justin Zakis
- National Acoustic Laboratories, Level 4, 16 University Avenue, Macquarie University, NSW 2109, Australia
| | - Helen Almond
- Institute of Health Management, 185-187 Boundary Road, North Melbourne, VIC 3051, Australia
| |
Collapse
|
4
|
Gaultier C, Goehring T. Recovering speech intelligibility with deep learning and multiple microphones in noisy-reverberant situations for people using cochlear implants. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:3833-3847. [PMID: 38884525 DOI: 10.1121/10.0026218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 05/10/2024] [Indexed: 06/18/2024]
Abstract
For cochlear implant (CI) listeners, holding a conversation in noisy and reverberant environments is often challenging. Deep-learning algorithms can potentially mitigate these difficulties by enhancing speech in everyday listening environments. This study compared several deep-learning algorithms with access to one, two unilateral, or six bilateral microphones that were trained to recover speech signals by jointly removing noise and reverberation. The noisy-reverberant speech and an ideal noise reduction algorithm served as lower and upper references, respectively. Objective signal metrics were compared with results from two listening tests, including 15 typical hearing listeners with CI simulations and 12 CI listeners. Large and statistically significant improvements in speech reception thresholds of 7.4 and 10.3 dB were found for the multi-microphone algorithms. For the single-microphone algorithm, there was an improvement of 2.3 dB but only for the CI listener group. The objective signal metrics correctly predicted the rank order of results for CI listeners, and there was an overall agreement for most effects and variances between results for CI simulations and CI listeners. These algorithms hold promise to improve speech intelligibility for CI listeners in environments with noise and reverberation and benefit from a boost in performance when using features extracted from multiple microphones.
Collapse
Affiliation(s)
- Clément Gaultier
- Cambridge Hearing Group, Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, CB2 7EF, United Kingdom
| | - Tobias Goehring
- Cambridge Hearing Group, Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, CB2 7EF, United Kingdom
| |
Collapse
|
5
|
Fan J, Williamson DS. From the perspective of perceptual speech quality: The robustness of frequency bands to noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:1916-1927. [PMID: 38456734 DOI: 10.1121/10.0025272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 02/22/2024] [Indexed: 03/09/2024]
Abstract
Speech quality is one of the main foci of speech-related research, where it is frequently studied with speech intelligibility, another essential measurement. Band-level perceptual speech intelligibility, however, has been studied frequently, whereas speech quality has not been thoroughly analyzed. In this paper, a Multiple Stimuli With Hidden Reference and Anchor (MUSHRA) inspired approach was proposed to study the individual robustness of frequency bands to noise with perceptual speech quality as the measure. Speech signals were filtered into thirty-two frequency bands with compromising real-world noise employed at different signal-to-noise ratios. Robustness to noise indices of individual frequency bands was calculated based on the human-rated perceptual quality scores assigned to the reconstructed noisy speech signals. Trends in the results suggest the mid-frequency region appeared less robust to noise in terms of perceptual speech quality. These findings suggest future research aiming at improving speech quality should pay more attention to the mid-frequency region of the speech signals accordingly.
Collapse
Affiliation(s)
- Junyi Fan
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Donald S Williamson
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
6
|
Sabin AT, McElhone D, Gauger D, Rabinowitz B. Modeling the Intelligibility Benefit of Active Noise Cancelation in Hearing Devices That Improve Signal-to-Noise Ratio. Trends Hear 2024; 28:23312165241260029. [PMID: 38831646 PMCID: PMC11149449 DOI: 10.1177/23312165241260029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 05/20/2024] [Accepted: 05/21/2024] [Indexed: 06/05/2024] Open
Abstract
The extent to which active noise cancelation (ANC), when combined with hearing assistance, can improve speech intelligibility in noise is not well understood. One possible source of benefit is ANC's ability to reduce the sound level of the direct (i.e., vent-transmitted) path. This reduction lowers the "floor" imposed by the direct path, thereby allowing any increases to the signal-to-noise ratio (SNR) created in the amplified path to be "realized" at the eardrum. Here we used a modeling approach to estimate this benefit. We compared pairs of simulated hearing aids that differ only in terms of their ability to provide ANC and computed intelligibility metrics on their outputs. The difference in metric scores between simulated devices is termed the "ANC Benefit." These simulations show that ANC Benefit increases as (1) the environmental sound level increases, (2) the ability of the hearing aid to improve SNR increases, (3) the strength of the ANC increases, and (4) the hearing loss severity decreases. The predicted size of the ANC Benefit can be substantial. For a moderate hearing loss, the model predicts improvement in intelligibility metrics of >30% when environments are moderately loud (>70 dB SPL) and devices are moderately capable of increasing SNR (by >4 dB). It appears that ANC can be a critical ingredient in hearing devices that attempt to improve SNR in loud environments. ANC will become more and more important as advanced SNR-improving algorithms (e.g., artificial intelligence speech enhancement) are included in hearing devices.
Collapse
|
7
|
Gutz SE, Maffei MF, Green JR. Feedback From Automatic Speech Recognition to Elicit Clear Speech in Healthy Speakers. AMERICAN JOURNAL OF SPEECH-LANGUAGE PATHOLOGY 2023; 32:2940-2959. [PMID: 37824377 PMCID: PMC10721250 DOI: 10.1044/2023_ajslp-23-00030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 04/10/2023] [Accepted: 08/01/2023] [Indexed: 10/14/2023]
Abstract
PURPOSE This study assessed the effectiveness of feedback generated by automatic speech recognition (ASR) for eliciting clear speech from young, healthy individuals. As a preliminary step toward exploring a novel method for eliciting clear speech in patients with dysarthria, we investigated the effects of ASR feedback in healthy controls. If successful, ASR feedback has the potential to facilitate independent, at-home clear speech practice. METHOD Twenty-three healthy control speakers (ages 23-40 years) read sentences aloud in three speaking modes: Habitual, Clear (over-enunciated), and in response to ASR feedback (ASR). In the ASR condition, we used Mozilla DeepSpeech to transcribe speech samples and provide participants with a value indicating the accuracy of the ASR's transcription. For speakers who achieved sufficiently high ASR accuracy, noise was added to their speech at a participant-specific signal-to-noise ratio to ensure that each participant had to over-enunciate to achieve high ASR accuracy. RESULTS Compared to habitual speech, speech produced in the ASR and Clear conditions was clearer, as rated by speech-language pathologists, and more intelligible, per speech-language pathologist transcriptions. Speech in the Clear and ASR conditions aligned on several acoustic measures, particularly those associated with increased vowel distinctiveness and decreased speaking rate. However, ASR accuracy, intelligibility, and clarity were each correlated with different speech features, which may have implications for how people change their speech for ASR feedback. CONCLUSIONS ASR successfully elicited outcomes similar to clear speech in healthy speakers. Future work should investigate its efficacy in eliciting clear speech in people with dysarthria.
Collapse
Affiliation(s)
- Sarah E. Gutz
- Department of Communication Sciences and Disorders, MGH Institute of Health Professions, Boston, MA
- Program in Speech and Hearing Bioscience and Technology, Harvard University, Cambridge, MA
| | - Marc F. Maffei
- Department of Communication Sciences and Disorders, MGH Institute of Health Professions, Boston, MA
| | - Jordan R. Green
- Department of Communication Sciences and Disorders, MGH Institute of Health Professions, Boston, MA
- Program in Speech and Hearing Bioscience and Technology, Harvard University, Cambridge, MA
| |
Collapse
|
8
|
Henry F, Parsi A, Glavin M, Jones E. Experimental Investigation of Acoustic Features to Optimize Intelligibility in Cochlear Implants. SENSORS (BASEL, SWITZERLAND) 2023; 23:7553. [PMID: 37688009 PMCID: PMC10490615 DOI: 10.3390/s23177553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Revised: 08/21/2023] [Accepted: 08/28/2023] [Indexed: 09/10/2023]
Abstract
Although cochlear implants work well for people with hearing impairment in quiet conditions, it is well-known that they are not as effective in noisy environments. Noise reduction algorithms based on machine learning allied with appropriate speech features can be used to address this problem. The purpose of this study is to investigate the importance of acoustic features in such algorithms. Acoustic features are extracted from speech and noise mixtures and used in conjunction with the ideal binary mask to train a deep neural network to estimate masks for speech synthesis to produce enhanced speech. The intelligibility of this speech is objectively measured using metrics such as Short-time Objective Intelligibility (STOI), Hit Rate minus False Alarm Rate (HIT-FA) and Normalized Covariance Measure (NCM) for both simulated normal-hearing and hearing-impaired scenarios. A wide range of existing features is experimentally evaluated, including features that have not been traditionally applied in this application. The results demonstrate that frequency domain features perform best. In particular, Gammatone features performed best for normal hearing over a range of signal-to-noise ratios and noise types (STOI = 0.7826). Mel spectrogram features exhibited the best overall performance for hearing impairment (NCM = 0.7314). There is a stronger correlation between STOI and NCM than HIT-FA and NCM, suggesting that the former is a better predictor of intelligibility for hearing-impaired listeners. The results of this study may be useful in the design of adaptive intelligibility enhancement systems for cochlear implants based on both the noise level and the nature of the noise (stationary or non-stationary).
Collapse
Affiliation(s)
- Fergal Henry
- Department of Computing and Electronic Engineering, Atlantic Technological University Sligo, Ash Lane, F91 YW50 Sligo, Ireland
| | - Ashkan Parsi
- Electrical and Electronic Engineering, University of Galway, University Road, H91 TK33 Galway, Ireland; (A.P.); (M.G.); (E.J.)
| | - Martin Glavin
- Electrical and Electronic Engineering, University of Galway, University Road, H91 TK33 Galway, Ireland; (A.P.); (M.G.); (E.J.)
| | - Edward Jones
- Electrical and Electronic Engineering, University of Galway, University Road, H91 TK33 Galway, Ireland; (A.P.); (M.G.); (E.J.)
| |
Collapse
|
9
|
Borrie SA, Yoho SE, Healy EW, Barrett TS. The Application of Time-Frequency Masking To Improve Intelligibility of Dysarthric Speech in Background Noise. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2023; 66:1853-1866. [PMID: 36944186 PMCID: PMC10457087 DOI: 10.1044/2023_jslhr-22-00558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 12/13/2022] [Accepted: 01/10/2023] [Indexed: 05/11/2023]
Abstract
PURPOSE Background noise reduces speech intelligibility. Time-frequency (T-F) masking is an established signal processing technique that improves intelligibility of neurotypical speech in background noise. Here, we investigated a novel application of T-F masking, assessing its potential to improve intelligibility of neurologically degraded speech in background noise. METHOD Listener participants (N = 422) completed an intelligibility task either in the laboratory or online, listening to and transcribing audio recordings of neurotypical (control) and neurologically degraded (dysarthria) speech under three different processing types: speech in quiet (quiet), speech mixed with cafeteria noise (noise), and speech mixed with cafeteria noise and then subsequently processed by an ideal quantized mask (IQM) to remove the noise. RESULTS We observed significant reductions in intelligibility of dysarthric speech, even at highly favorable signal-to-noise ratios (+11 to +23 dB) that did not impact neurotypical speech. We also observed significant intelligibility improvements from speech in noise to IQM-processed speech for both control and dysarthric speech across a wide range of noise levels. Furthermore, the overall benefit of IQM processing for dysarthric speech was comparable with that of the control speech in background noise, as was the intelligibility data collected in the laboratory versus online. CONCLUSIONS This study demonstrates proof of concept, validating the application of T-F masks to a neurologically degraded speech signal. Given that intelligibility challenges greatly impact communication, and thus the lives of people with dysarthria and their communication partners, the development of clinical tools to enhance intelligibility in this clinical population is critical.
Collapse
Affiliation(s)
- Stephanie A. Borrie
- Department of Communicative Disorders and Deaf Education, Utah State University, Logan
| | - Sarah E. Yoho
- Department of Communicative Disorders and Deaf Education, Utah State University, Logan
- Department of Speech and Hearing Science, The Ohio State University, Columbus
| | - Eric W. Healy
- Department of Speech and Hearing Science, The Ohio State University, Columbus
| | | |
Collapse
|
10
|
Healy EW, Johnson EM, Pandey A, Wang D. Progress made in the efficacy and viability of deep-learning-based noise reduction. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:2751. [PMID: 37133814 PMCID: PMC10159658 DOI: 10.1121/10.0019341] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2022] [Revised: 04/17/2023] [Accepted: 04/17/2023] [Indexed: 05/04/2023]
Abstract
Recent years have brought considerable advances to our ability to increase intelligibility through deep-learning-based noise reduction, especially for hearing-impaired (HI) listeners. In this study, intelligibility improvements resulting from a current algorithm are assessed. These benefits are compared to those resulting from the initial demonstration of deep-learning-based noise reduction for HI listeners ten years ago in Healy, Yoho, Wang, and Wang [(2013). J. Acoust. Soc. Am. 134, 3029-3038]. The stimuli and procedures were broadly similar across studies. However, whereas the initial study involved highly matched training and test conditions, as well as non-causal operation, preventing its ability to operate in the real world, the current attentive recurrent network employed different noise types, talkers, and speech corpora for training versus test, as required for generalization, and it was fully causal, as required for real-time operation. Significant intelligibility benefit was observed in every condition, which averaged 51% points across conditions for HI listeners. Further, benefit was comparable to that obtained in the initial demonstration, despite the considerable additional demands placed on the current algorithm. The retention of large benefit despite the systematic removal of various constraints as required for real-world operation reflects the substantial advances made to deep-learning-based noise reduction.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric M Johnson
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - Ashutosh Pandey
- Department of Computer Science and Engineering, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
11
|
Venkata Lakshmi S, Sujatha K, Janet J. A hybrid discriminant fuzzy DNN with enhanced modularity bat algorithm for speech recognition. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-212945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
In recent years, speech processing resides a major application in the domain of signal processing. Due to the audibility loss of some speech signals, people with hearing impairment have difficulty in understanding speech, which reintroduces a crucial role in speech recognition. Automatic Speech Recognition (ASR) development is a major challenge in research in the case of noise, domain, vocabulary size, and language and speaker variability. Speech recognition system design needs careful attention to challenges or issues like performance and database evaluation, feature extraction methods, speech representations and speech classes. In this paper, HDF-DNN model has been proposed with the hybridization of discriminant fuzzy function and deep neural network for speech recognition. Initially, the speech signals are pre-processed to eliminate the unwanted noise and the features are extracted using Mel Frequency Cepstral Coefficient (MFCC). A hybrid Deep Neural Network and Discriminant Fuzzy Logic is used for assisting hearing-impaired listeners with enhanced speech intelligibility. Both DNN and DF have some problems with parameters to address this problem, Enhanced Modularity function-based Bat Algorithm (EMBA) is used as a powerful optimization tool. The experimental results show that the proposed automatic speech recognition-based hybrid deep learning model is effectively-identifies speech recognition more than the MFCC-CNN, CSVM and Deep auto encoder techniques. The proposed method improves the overall accuracy of 8.31%, 9.71% and 10.25% better than, MFCC-CNN, CSVM and Deep auto encoder respectively.
Collapse
Affiliation(s)
- S. Venkata Lakshmi
- Department of Artificial Intelligence and Data Science, Sri Krishna College of Engineering and Technology, Coimbatore, Tamil Nadu, India
| | - K. Sujatha
- Department of Computer Science, Wenzhou-Kean University, Zhejiang Province, China
| | - J. Janet
- Department of CSE, Sri Krishna College of Engineering and Technology, Coimbatore, India
| |
Collapse
|
12
|
Chou KF, Boyd AD, Best V, Colburn HS, Sen K. A biologically oriented algorithm for spatial sound segregation. Front Neurosci 2022; 16:1004071. [PMID: 36312015 PMCID: PMC9614053 DOI: 10.3389/fnins.2022.1004071] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 09/28/2022] [Indexed: 11/13/2022] Open
Abstract
Listening in an acoustically cluttered scene remains a difficult task for both machines and hearing-impaired listeners. Normal-hearing listeners accomplish this task with relative ease by segregating the scene into its constituent sound sources, then selecting and attending to a target source. An assistive listening device that mimics the biological mechanisms underlying this behavior may provide an effective solution for those with difficulty listening in acoustically cluttered environments (e.g., a cocktail party). Here, we present a binaural sound segregation algorithm based on a hierarchical network model of the auditory system. In the algorithm, binaural sound inputs first drive populations of neurons tuned to specific spatial locations and frequencies. The spiking response of neurons in the output layer are then reconstructed into audible waveforms via a novel reconstruction method. We evaluate the performance of the algorithm with a speech-on-speech intelligibility task in normal-hearing listeners. This two-microphone-input algorithm is shown to provide listeners with perceptual benefit similar to that of a 16-microphone acoustic beamformer. These results demonstrate the promise of this biologically inspired algorithm for enhancing selective listening in challenging multi-talker scenes.
Collapse
Affiliation(s)
- Kenny F. Chou
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Alexander D. Boyd
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Virginia Best
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, MA, United States
| | - H. Steven Colburn
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Kamal Sen
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
- *Correspondence: Kamal Sen,
| |
Collapse
|
13
|
Carter BL, Apoux F, Healy EW. The Influence of Noise Type and Semantic Predictability on Word Recall in Older Listeners and Listeners With Hearing Impairment. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2022; 65:3548-3565. [PMID: 35973100 PMCID: PMC9913215 DOI: 10.1044/2022_jslhr-22-00075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 05/01/2022] [Accepted: 05/11/2022] [Indexed: 06/15/2023]
Abstract
PURPOSE A dual-task paradigm was implemented to investigate how noise type and sentence context may interact with age and hearing loss to impact word recall during speech recognition. METHOD Three noise types with varying degrees of temporal/spectrotemporal modulation were used: speech-shaped noise, speech-modulated noise, and three-talker babble. Participant groups included younger listeners with normal hearing (NH), older listeners with near-normal hearing, and older listeners with sensorineural hearing loss. An adaptive measure was used to establish the signal-to-noise ratio approximating 70% sentence recognition for each participant in each noise type. A word-recall task was then implemented while matching speech-recognition performance across noise types and participant groups. Random-intercept linear mixed-effects models were used to determine the effects of and interactions between noise type, sentence context, and participant group on word recall. RESULTS The results suggest that noise type does not significantly impact word recall when word-recognition performance is controlled. When data from noise types were pooled and compared with quiet, and recall was assessed: older listeners with near-normal hearing performed well when either quiet backgrounds or high sentence context (or both) were present, but older listeners with hearing loss performed well only when both quiet backgrounds and high sentence context were present. Younger listeners with NH were robust to the detrimental effects of noise and low context. CONCLUSIONS The general presence of noise has the potential to decrease word recall, but type of noise does not appear to significantly impact this observation when overall task difficulty is controlled. The presence of noise as well as deficits related to age and/or hearing loss appear to limit the availability of cognitive processing resources available for working memory during conversation in difficult listening environments. The conversation environments that impact these resources appear to differ depending on age and/or hearing status.
Collapse
Affiliation(s)
- Brittney L. Carter
- Department of Speech and Hearing Science, The Ohio State University, Columbus
| | - Frédéric Apoux
- Department of Otolaryngology—Head & Neck Surgery, The Ohio State University, Columbus
| | - Eric W. Healy
- Department of Speech and Hearing Science, The Ohio State University, Columbus
| |
Collapse
|
14
|
Speech recognition using Taylor-gradient Descent political optimization based Deep residual network. COMPUT SPEECH LANG 2022. [DOI: 10.1016/j.csl.2022.101442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
15
|
Gutz SE, Rowe HP, Tilton-Bolowsky VE, Green JR. Speaking with a KN95 face mask: a within-subjects study on speaker adaptation and strategies to improve intelligibility. Cogn Res Princ Implic 2022; 7:73. [PMID: 35907167 PMCID: PMC9339031 DOI: 10.1186/s41235-022-00423-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Accepted: 07/18/2022] [Indexed: 11/15/2022] Open
Abstract
Mask-wearing during the COVID-19 pandemic has prompted a growing interest in the functional impact of masks on speech and communication. Prior work has shown that masks dampen sound, impede visual communication cues, and reduce intelligibility. However, more work is needed to understand how speakers change their speech while wearing a mask and to identify strategies to overcome the impact of wearing a mask. Data were collected from 19 healthy adults during a single in-person session. We investigated the effects of wearing a KN95 mask on speech intelligibility, as judged by two speech-language pathologists, examined speech kinematics and acoustics associated with mask-wearing, and explored KN95 acoustic filtering. We then considered the efficacy of three speaking strategies to improve speech intelligibility: Loud, Clear, and Slow speech. To inform speaker strategy recommendations, we related findings to self-reported speaker effort. Results indicated that healthy speakers could compensate for the presence of a mask and achieve normal speech intelligibility. Additionally, we showed that speaking loudly or clearly-and, to a lesser extent, slowly-improved speech intelligibility. However, using these strategies may require increased physical and cognitive effort and should be used only when necessary. These results can inform recommendations for speakers wearing masks, particularly those with communication disorders (e.g., dysarthria) who may struggle to adapt to a mask but can respond to explicit instructions. Such recommendations may further help non-native speakers and those communicating in a noisy environment or with listeners with hearing loss.
Collapse
Affiliation(s)
- Sarah E. Gutz
- Program in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, MA USA
| | - Hannah P. Rowe
- Department of Communication Sciences and Disorders, MGH Institute of Health Professions, Building 79/96, 2nd floor, 13th Street, Boston, MA 02129 USA
| | - Victoria E. Tilton-Bolowsky
- Department of Communication Sciences and Disorders, MGH Institute of Health Professions, Building 79/96, 2nd floor, 13th Street, Boston, MA 02129 USA
| | - Jordan R. Green
- Program in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, MA USA
- Department of Communication Sciences and Disorders, MGH Institute of Health Professions, Building 79/96, 2nd floor, 13th Street, Boston, MA 02129 USA
| |
Collapse
|
16
|
Recent Trends in AI-Based Intelligent Sensing. ELECTRONICS 2022. [DOI: 10.3390/electronics11101661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
In recent years, intelligent sensing has gained significant attention because of its autonomous decision-making ability to solve complex problems. Today, smart sensors complement and enhance the capabilities of human beings and have been widely embraced in numerous application areas. Artificial intelligence (AI) has made astounding growth in domains of natural language processing, machine learning (ML), and computer vision. The methods based on AI enable a computer to learn and monitor activities by sensing the source of information in a real-time environment. The combination of these two technologies provides a promising solution in intelligent sensing. This survey provides a comprehensive summary of recent research on AI-based algorithms for intelligent sensing. This work also presents a comparative analysis of algorithms, models, influential parameters, available datasets, applications and projects in the area of intelligent sensing. Furthermore, we present a taxonomy of AI models along with the cutting edge approaches. Finally, we highlight challenges and open issues, followed by the future research directions pertaining to this exciting and fast-moving field.
Collapse
|
17
|
Abstract
Hearing aids continue to acquire increasingly sophisticated sound-processing features beyond basic amplification. On the one hand, these have the potential to add user benefit and allow for personalization. On the other hand, if such features are to benefit according to their potential, they require clinicians to be acquainted with both the underlying technologies and the specific fitting handles made available by the individual hearing aid manufacturers. Ensuring benefit from hearing aids in typical daily listening environments requires that the hearing aids handle sounds that interfere with communication, generically referred to as “noise.” With this aim, considerable efforts from both academia and industry have led to increasingly advanced algorithms that handle noise, typically using the principles of directional processing and postfiltering. This article provides an overview of the techniques used for noise reduction in modern hearing aids. First, classical techniques are covered as they are used in modern hearing aids. The discussion then shifts to how deep learning, a subfield of artificial intelligence, provides a radically different way of solving the noise problem. Finally, the results of several experiments are used to showcase the benefits of recent algorithmic advances in terms of signal-to-noise ratio, speech intelligibility, selective attention, and listening effort.
Collapse
|
18
|
Healy EW, Johnson EM, Delfarah M, Krishnagiri DS, Sevich VA, Taherian H, Wang D. Deep learning based speaker separation and dereverberation can generalize across different languages to improve intelligibility. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:2526. [PMID: 34717521 PMCID: PMC8637753 DOI: 10.1121/10.0006565] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Revised: 09/16/2021] [Accepted: 09/16/2021] [Indexed: 05/20/2023]
Abstract
The practical efficacy of deep learning based speaker separation and/or dereverberation hinges on its ability to generalize to conditions not employed during neural network training. The current study was designed to assess the ability to generalize across extremely different training versus test environments. Training and testing were performed using different languages having no known common ancestry and correspondingly large linguistic differences-English for training and Mandarin for testing. Additional generalizations included untrained speech corpus/recording channel, target-to-interferer energy ratios, reverberation room impulse responses, and test talkers. A deep computational auditory scene analysis algorithm, employing complex time-frequency masking to estimate both magnitude and phase, was used to segregate two concurrent talkers and simultaneously remove large amounts of room reverberation to increase the intelligibility of a target talker. Significant intelligibility improvements were observed for the normal-hearing listeners in every condition. Benefit averaged 43.5% points across conditions and was comparable to that obtained when training and testing were performed both in English. Benefit is projected to be considerably larger for individuals with hearing impairment. It is concluded that a properly designed and trained deep speaker separation/dereverberation network can be capable of generalization across vastly different acoustic environments that include different languages.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric M Johnson
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - Masood Delfarah
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Divya S Krishnagiri
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - Victoria A Sevich
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - Hassan Taherian
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
19
|
Tang Y. Glimpse-based estimation of speech intelligibility from speech-in-noise using artificial neural networks. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2021.101220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
20
|
Geravanchizadeh M, Zakeri S. Ear-EEG-based binaural speech enhancement (ee-BSE) using auditory attention detection and audiometric characteristics of hearing-impaired subjects. J Neural Eng 2021; 18. [PMID: 34289464 DOI: 10.1088/1741-2552/ac16b4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 07/21/2021] [Indexed: 11/11/2022]
Abstract
Objective. Speech perception in cocktail party scenarios has been the concern of a group of researchers who are involved with the design of hearing-aid devices.Approach. In this paper, a new unified ear-EEG-based binaural speech enhancement system is introduced for hearing-impaired (HI) listeners. The proposed model, which is based on auditory attention detection (AAD) and individual hearing threshold (HT) characteristics, has four main processing stages. In the binaural processing stage, a system based on the deep neural network is trained to estimate auditory ratio masks for each of the speakers in the mixture signal. In the EEG processing stage, AAD is employed to select one ratio mask corresponding to the attended speech. Here, the same EEG data is also used to predict the HTs of listeners who participated in the EEG recordings. The third stage, called insertion gain computation, concerns the calculation of a special amplification gain based on individual HTs. Finally, in the selection-resynthesis-amplification stage, the attended speech signals of the target are resynthesized based on the selected auditory mask and then are amplified using the computed insertion gain.Main results. The detection of the attended speech and the HTs are achieved by classifiers that are trained with features extracted from the scalp EEG or the ear EEG signals. The results of evaluating AAD and HT detection show high detection accuracies. The systematic evaluations of the proposed system yield substantial intelligibility and quality improvements for the HI and normal-hearingaudiograms.Significance. The AAD method determines the direction of attention from single-trial EEG signals without access to audio signals of the speakers. The amplification procedure could be adjusted for each subject based on the individual HTs. The present model has the potential to be considered as an important processing tool to personalize the neuro-steered hearing aids.
Collapse
Affiliation(s)
- Masoud Geravanchizadeh
- Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz 51666-15813, Iran
| | - Sahar Zakeri
- Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz 51666-15813, Iran
| |
Collapse
|
21
|
Behavioral Pattern Analysis between Bilingual and Monolingual Listeners’ Natural Speech Perception on Foreign-Accented English Language Using Different Machine Learning Approaches. TECHNOLOGIES 2021. [DOI: 10.3390/technologies9030051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Speech perception in an adverse background/noisy environment is a complex and challenging human process, which is made even more complicated in foreign-accented language for bilingual and monolingual individuals. Listeners who have difficulties in hearing are affected most by such a situation. Despite considerable efforts, the increase in speech intelligibility in noise remains elusive. Considering this opportunity, this study investigates Bengali–English bilinguals and native American English monolinguals’ behavioral patterns on foreign-accented English language considering bubble noise, gaussian or white noise, and quiet sound level. Twelve regular hearing participants (Six Bengali–English bilinguals and Six Native American English monolinguals) joined in this study. Statistical computation shows that speech with different noise has a significant effect (p = 0.009) on listening for both bilingual and monolingual under different sound levels (e.g., 55 dB, 65 dB, and 75 dB). Here, six different machine learning approaches (Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-nearest neighbors (KNN), Naïve Bayes (NB), Classification and regression trees (CART), and Support vector machine (SVM)) are tested and evaluated to differentiate between bilingual and monolingual individuals from their behavioral patterns in both noisy and quiet environments. Results show that most optimal performances were observed using LDA by successfully differentiating between bilingual and monolingual 60% of the time. A deep neural network-based model is proposed to improve this measure further and achieved an accuracy of nearly 100% in successfully differentiating between bilingual and monolingual individuals.
Collapse
|
22
|
Healy EW, Tan K, Johnson EM, Wang D. An effectively causal deep learning algorithm to increase intelligibility in untrained noises for hearing-impaired listeners. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 149:3943. [PMID: 34241481 PMCID: PMC8186949 DOI: 10.1121/10.0005089] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Revised: 05/09/2021] [Accepted: 05/10/2021] [Indexed: 05/20/2023]
Abstract
Real-time operation is critical for noise reduction in hearing technology. The essential requirement of real-time operation is causality-that an algorithm does not use future time-frame information and, instead, completes its operation by the end of the current time frame. This requirement is extended currently through the concept of "effectively causal," in which future time-frame information within the brief delay tolerance of the human speech-perception mechanism is used. Effectively causal deep learning was used to separate speech from background noise and improve intelligibility for hearing-impaired listeners. A single-microphone, gated convolutional recurrent network was used to perform complex spectral mapping. By estimating both the real and imaginary parts of the noise-free speech, both the magnitude and phase of the estimated noise-free speech were obtained. The deep neural network was trained using a large set of noises and tested using complex noises not employed during training. Significant algorithm benefit was observed in every condition, which was largest for those with the greatest hearing loss. Allowable delays across different communication settings are reviewed and assessed. The current work demonstrates that effectively causal deep learning can significantly improve intelligibility for one of the largest populations of need in challenging conditions involving untrained background noises.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - Ke Tan
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric M Johnson
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
23
|
Defending Against Microphone-Based Attacks with Personalized Noise. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2021. [DOI: 10.2478/popets-2021-0021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
Voice-activated commands have become a key feature of popular devices such as smartphones, home assistants, and wearables. For convenience, many people configure their devices to be ‘always on’ and listening for voice commands from the user using a trigger phrase such as “Hey Siri,” “Okay Google,” or “Alexa.” However, false positives for these triggers often result in privacy violations with conversations being inadvertently uploaded to the cloud. In addition, malware that can record one’s conversations remains a signifi-cant threat to privacy. Unlike with cameras, which people can physically obscure and be assured of their privacy, people do not have a way of knowing whether their microphone is indeed off and are left with no tangible defenses against voice based attacks. We envision a general-purpose physical defense that uses a speaker to inject specialized obfuscating ‘babble noise’ into the microphones of devices to protect against automated and human based attacks. We present a comprehensive study of how specially crafted, personalized ‘babble’ noise (‘MyBabble’) can be effective at moderate signal-to-noise ratios and can provide a viable defense against microphone based eavesdropping attacks.
Collapse
|
24
|
Improving Speech Quality for Hearing Aid Applications Based on Wiener Filter and Composite of Deep Denoising Autoencoders. SIGNALS 2020. [DOI: 10.3390/signals1020008] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In hearing aid devices, speech enhancement techniques are a critical component to enable users with hearing loss to attain improved speech quality under noisy conditions. Recently, the deep denoising autoencoder (DDAE) was adopted successfully for recovering the desired speech from noisy observations. However, a single DDAE cannot extract contextual information sufficiently due to the poor generalization in an unknown signal-to-noise ratio (SNR), the local minima, and the fact that the enhanced output shows some residual noise and some level of discontinuity. In this paper, we propose a hybrid approach for hearing aid applications based on two stages: (1) the Wiener filter, which attenuates the noise component and generates a clean speech signal; (2) a composite of three DDAEs with different window lengths, each of which is specialized for a specific enhancement task. Two typical high-frequency hearing loss audiograms were used to test the performance of the approach: Audiogram 1 = (0, 0, 0, 60, 80, 90) and Audiogram 2 = (0, 15, 30, 60, 80, 85). The hearing-aid speech perception index, the hearing-aid speech quality index, and the perceptual evaluation of speech quality were used to evaluate the performance. The experimental results show that the proposed method achieved significantly better results compared with the Wiener filter or a single deep denoising autoencoder alone.
Collapse
|
25
|
Wearable Hearing Device Spectral Enhancement Driven by Non-Negative Sparse Coding-Based Residual Noise Reduction. SENSORS 2020; 20:s20205751. [PMID: 33050447 PMCID: PMC7600179 DOI: 10.3390/s20205751] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/02/2020] [Accepted: 10/06/2020] [Indexed: 11/17/2022]
Abstract
This paper proposes a novel technique to improve a spectral statistical filter for speech enhancement, to be applied in wearable hearing devices such as hearing aids. The proposed method is implemented considering a 32-channel uniform polyphase discrete Fourier transform filter bank, for which the overall algorithm processing delay is 8 ms in accordance with the hearing device requirements. The proposed speech enhancement technique, which exploits the concepts of both non-negative sparse coding (NNSC) and spectral statistical filtering, provides an online unified framework to overcome the problem of residual noise in spectral statistical filters under noisy environments. First, the spectral gain attenuator of the statistical Wiener filter is obtained using the a priori signal-to-noise ratio (SNR) estimated through a decision-directed approach. Next, the spectrum estimated using the Wiener spectral gain attenuator is decomposed by applying the NNSC technique to the target speech and residual noise components. These components are used to develop an NNSC-based Wiener spectral gain attenuator to achieve enhanced speech. The performance of the proposed NNSC-Wiener filter was evaluated through a perceptual evaluation of the speech quality scores under various noise conditions with SNRs ranging from -5 to 20 dB. The results indicated that the proposed NNSC-Wiener filter can outperform the conventional Wiener filter and NNSC-based speech enhancement methods at all SNRs.
Collapse
|
26
|
Rajesh Kumar T, Suresh GR, Kanaga Subaraja S, Karthikeyan C. Taylor‐AMS features and deep convolutional neural network for converting nonaudible murmur to normal speech. Comput Intell 2020. [DOI: 10.1111/coin.12281] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- T. Rajesh Kumar
- Department of Computer Science and EngineeringKoneru Lakshmaiah Education Foundation Guntur India
| | - G. R Suresh
- Department of Biomedical EngineeringSt. Peter's Institute of Higher Education and Research Chennai India
| | - S. Kanaga Subaraja
- Department of Computer Science and EngineeringEaswari Engineering College Chennai India
| | - C. Karthikeyan
- Department of Computer Science and EngineeringKoneru Lakshmaiah Education Foundation Guntur India
| |
Collapse
|
27
|
Auditory Device Voice Activity Detection Based on Statistical Likelihood-Ratio Order Statistics. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10155026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This paper proposes a technique for improving statistical-model-based voice activity detection (VAD) in noisy environments to be applied in an auditory hearing aid. The proposed method is implemented for a uniform polyphase discrete Fourier transform filter bank satisfying an auditory device time latency of 8 ms. The proposed VAD technique provides an online unified framework to overcome the frequent false rejection of the statistical-model-based likelihood-ratio test (LRT) in noisy environments. The method is based on the observation that the sparseness of speech and background noise cause high false-rejection error rates in statistical LRT-based VAD—the false rejection rate increases as the sparseness increases. We demonstrate that the false-rejection error rate can be reduced by incorporating likelihood-ratio order statistics into a conventional LRT VAD. We confirm experimentally that the proposed method relatively reduces the average detection error rate by 15.8% compared to a conventional VAD with only minimal change in the false acceptance probability for three different noise conditions whose signal-to-noise ratio ranges from 0 to 20 dB.
Collapse
|
28
|
Mourão GL, Costa MH, Paul S. Speech Intelligibility for Cochlear Implant Users with the MMSE Noise-Reduction Time-Frequency Mask. Biomed Signal Process Control 2020. [DOI: 10.1016/j.bspc.2020.101982] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
29
|
Healy EW, Johnson EM, Delfarah M, Wang D. A talker-independent deep learning algorithm to increase intelligibility for hearing-impaired listeners in reverberant competing talker conditions. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:4106. [PMID: 32611178 PMCID: PMC7314568 DOI: 10.1121/10.0001441] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 05/28/2020] [Accepted: 05/29/2020] [Indexed: 05/20/2023]
Abstract
Deep learning based speech separation or noise reduction needs to generalize to voices not encountered during training and to operate under multiple corruptions. The current study provides such a demonstration for hearing-impaired (HI) listeners. Sentence intelligibility was assessed under conditions of a single interfering talker and substantial amounts of room reverberation. A talker-independent deep computational auditory scene analysis (CASA) algorithm was employed, in which talkers were separated and dereverberated in each time frame (simultaneous grouping stage), then the separated frames were organized to form two streams (sequential grouping stage). The deep neural networks consisted of specialized convolutional neural networks, one based on U-Net and the other a temporal convolutional network. It was found that every HI (and normal-hearing, NH) listener received algorithm benefit in every condition. Benefit averaged across all conditions ranged from 52 to 76 percentage points for individual HI listeners and averaged 65 points. Further, processed HI intelligibility significantly exceeded unprocessed NH intelligibility. Although the current utterance-based model was not implemented as a real-time system, a perspective on this important issue is provided. It is concluded that deep CASA represents a powerful framework capable of producing large increases in HI intelligibility for potentially any two voices.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric M Johnson
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - Masood Delfarah
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
30
|
Liu F, Demosthenous A, Yasin I. Auditory filter-bank compression improves estimation of signal-to-noise ratio for speech in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:3197. [PMID: 32486788 DOI: 10.1121/10.0001168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/13/2019] [Accepted: 04/09/2020] [Indexed: 06/11/2023]
Abstract
Signal-to-noise ratio (SNR) estimation is necessary for many speech processing applications often challenged by nonstationary noise. The authors have previously demonstrated that the variance of spectral entropy (VSE) is a reliable estimate of SNR in nonstationary noise. Based on pre-estimated VSE-SNR relationship functions, the SNR of unseen acoustic environments can be estimated from the measured VSE. This study predicts that introducing a compressive function based on cochlear processing will increase the stability of the pre-estimated VSE-SNR relationship functions. This study demonstrates that calculating the VSE based on a nonlinear filter-bank, simulating cochlear compression, reduces the VSE-based SNR estimation errors. VSE-SNR relationship functions were estimated using speech tokens presented in babble noise comprised of different numbers of speakers. Results showed that the coefficient of determination (R2) of the estimated VSE-SNR relationship functions have absolute percentage improvements of over 26% when using a filter-bank with a compressive function, compared to when using a linear filter-bank without compression. In 2-talker babble noise, the estimation accuracy is more than 3 dB better than other published methods.
Collapse
Affiliation(s)
- Fangqi Liu
- Department of Electronic and Electrical Engineering, University College London, London WC1E 7JE, United Kingdom
| | - Andreas Demosthenous
- Department of Electronic and Electrical Engineering, University College London, London WC1E 7JE, United Kingdom
| | - Ifat Yasin
- Department of Computer Science, University College London, London WC1E 6BT, United Kingdom
| |
Collapse
|
31
|
Khaleelur Rahiman PF, Jayanthi VS, Jayanthi AN. RETRACTED: Speech enhancement method using deep learning approach for hearing-impaired listeners. Health Informatics J 2020; 27:1460458219893850. [PMID: 31969042 DOI: 10.1177/1460458219893850] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Affiliation(s)
| | - V S Jayanthi
- Rajagiri School of Engineering and Technology, India
| | | |
Collapse
|
32
|
Gutiérrez-Muñoz M, González-Salazar A, Coto-Jiménez M. Evaluation of Mixed Deep Neural Networks for Reverberant Speech Enhancement. Biomimetics (Basel) 2019; 5:biomimetics5010001. [PMID: 31861828 PMCID: PMC7148527 DOI: 10.3390/biomimetics5010001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Revised: 12/06/2019] [Accepted: 12/16/2019] [Indexed: 11/16/2022] Open
Abstract
Speech signals are degraded in real-life environments, as a product of background noise or other factors. The processing of such signals for voice recognition and voice analysis systems presents important challenges. One of the conditions that make adverse quality difficult to handle in those systems is reverberation, produced by sound wave reflections that travel from the source to the microphone in multiple directions. To enhance signals in such adverse conditions, several deep learning-based methods have been proposed and proven to be effective. Recently, recurrent neural networks, especially those with long short-term memory (LSTM), have presented surprising results in tasks related to time-dependent processing of signals, such as speech. One of the most challenging aspects of LSTM networks is the high computational cost of the training procedure, which has limited extended experimentation in several cases. In this work, we present a proposal to evaluate the hybrid models of neural networks to learn different reverberation conditions without any previous information. The results show that some combinations of LSTM and perceptron layers produce good results in comparison to those from pure LSTM networks, given a fixed number of layers. The evaluation was made based on quality measurements of the signal’s spectrum, the training time of the networks, and statistical validation of results. In total, 120 artificial neural networks of eight different types were trained and compared. The results help to affirm the fact that hybrid networks represent an important solution for speech signal enhancement, given that reduction in training time is on the order of 30%, in processes that can normally take several days or weeks, depending on the amount of data. The results also present advantages in efficiency, but without a significant drop in quality.
Collapse
|
33
|
Delfarah M, Wang D. Deep Learning for Talker-dependent Reverberant Speaker Separation: An Empirical Study. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2019; 27:1839-1848. [PMID: 33748321 PMCID: PMC7970708 DOI: 10.1109/taslp.2019.2934319] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Speaker separation refers to the problem of separating speech signals from a mixture of simultaneous speakers. Previous studies are limited to addressing the speaker separation problem in anechoic conditions. This paper addresses the problem of talker-dependent speaker separation in reverberant conditions, which are characteristic of real-world environments. We employ recurrent neural networks with bidirectional long short-term memory (BLSTM) to separate and dereverberate the target speech signal. We propose two-stage networks to effectively deal with both speaker separation and speech dereverberation. In the two-stage model, the first stage separates and dereverberates two-talker mixtures and the second stage further enhances the separated target signal. We have extensively evaluated the two-stage architecture, and our empirical results demonstrate large improvements over unprocessed mixtures and clear performance gain over single-stage networks in a wide range of target-to-interferer ratios and reverberation times in simulated as well as recorded rooms. Moreover, we show that time-frequency masking yields better performance than spectral mapping for reverberant speaker separation.
Collapse
Affiliation(s)
- Masood Delfarah
- Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| | - DeLiang Wang
- Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
34
|
Goehring T, Keshavarzi M, Carlyon RP, Moore BCJ. Using recurrent neural networks to improve the perception of speech in non-stationary noise by people with cochlear implants. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:705. [PMID: 31370586 PMCID: PMC6773603 DOI: 10.1121/1.5119226] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Accepted: 07/08/2019] [Indexed: 05/20/2023]
Abstract
Speech-in-noise perception is a major problem for users of cochlear implants (CIs), especially with non-stationary background noise. Noise-reduction algorithms have produced benefits but relied on a priori information about the target speaker and/or background noise. A recurrent neural network (RNN) algorithm was developed for enhancing speech in non-stationary noise and its benefits were evaluated for speech perception, using both objective measures and experiments with CI simulations and CI users. The RNN was trained using speech from many talkers mixed with multi-talker or traffic noise recordings. Its performance was evaluated using speech from an unseen talker mixed with different noise recordings of the same class, either babble or traffic noise. Objective measures indicated benefits of using a recurrent over a feed-forward architecture, and predicted better speech intelligibility with than without the processing. The experimental results showed significantly improved intelligibility of speech in babble noise but not in traffic noise. CI subjects rated the processed stimuli as significantly better in terms of speech distortions, noise intrusiveness, and overall quality than unprocessed stimuli for both babble and traffic noise. These results extend previous findings for CI users to mostly unseen acoustic conditions with non-stationary noise.
Collapse
Affiliation(s)
- Tobias Goehring
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, 15 Chaucer Road, Cambridge CB2 7EF, United Kingdom
| | - Mahmoud Keshavarzi
- Department of Experimental Psychology, University of Cambridge, Downing Street, Cambridge CB2 3EB, United Kingdom
| | - Robert P Carlyon
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, 15 Chaucer Road, Cambridge CB2 7EF, United Kingdom
| | - Brian C J Moore
- Department of Experimental Psychology, University of Cambridge, Downing Street, Cambridge CB2 3EB, United Kingdom
| |
Collapse
|
35
|
Bhat GS, Shankar N, Reddy CKA, Panahi IMS. A Real-Time Convolutional Neural Network Based Speech Enhancement for Hearing Impaired Listeners Using Smartphone. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2019; 7:78421-78433. [PMID: 32661495 PMCID: PMC7357966 DOI: 10.1109/access.2019.2922370] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This paper presents a Speech Enhancement (SE) technique based on multi-objective learning convolutional neural network to improve the overall quality of speech perceived by Hearing Aid (HA) users. The proposed method is implemented on a smartphone as an application that performs real-time SE. This arrangement works as an assistive tool to HA. A multi-objective learning architecture including primary and secondary features uses a mapping-based convolutional neural network (CNN) model to remove noise from a noisy speech spectrum. The algorithm is computationally fast and has a low processing delay which enables it to operate seamlessly on a smartphone. The steps and the detailed analysis of real-time implementation are discussed. The proposed method is compared with existing conventional and neural network-based SE techniques through speech quality and intelligibility metrics in various noisy speech conditions. The key contribution of this paper includes the realization of CNN SE model on a smartphone processor that works seamlessly with HA. The experimental results demonstrate significant improvements over the state-of-the-art techniques and reflect the usability of the developed SE application in noisy environments.
Collapse
Affiliation(s)
- Gautam S Bhat
- Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson TX-75080, USA
| | - Nikhil Shankar
- Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson TX-75080, USA
| | | | - Issa M S Panahi
- Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson TX-75080, USA
| |
Collapse
|
36
|
Healy EW, Vasko JL, Wang D. The optimal threshold for removing noise from speech is similar across normal and impaired hearing-a time-frequency masking study. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:EL581. [PMID: 31255108 PMCID: PMC6786891 DOI: 10.1121/1.5112828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
Hearing-impaired listeners' intolerance to background noise during speech perception is well known. The current study employed speech materials free of ceiling effects to reveal the optimal trade-off between rejecting noise and retaining speech during time-frequency masking. This relative criterion value (-7 dB) was found to hold across noise types that differ in acoustic spectro-temporal complexity. It was also found that listeners with hearing impairment and those with normal hearing performed optimally at this same value, suggesting no true noise intolerance once time-frequency units containing speech are extracted.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, ,
| | - Jordan L Vasko
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, ,
| | - DeLiang Wang
- Department of Computer Science and Engineering, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210,
| |
Collapse
|
37
|
Keshavarzi M, Goehring T, Turner RE, Moore BCJ. Comparison of effects on subjective intelligibility and quality of speech in babble for two algorithms: A deep recurrent neural network and spectral subtraction. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:1493. [PMID: 31067946 DOI: 10.1121/1.5094765] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Accepted: 03/01/2019] [Indexed: 06/09/2023]
Abstract
The effects on speech intelligibility and sound quality of two noise-reduction algorithms were compared: a deep recurrent neural network (RNN) and spectral subtraction (SS). The RNN was trained using sentences spoken by a large number of talkers with a variety of accents, presented in babble. Different talkers were used for testing. Participants with mild-to-moderate hearing loss were tested. Stimuli were given frequency-dependent linear amplification to compensate for the individual hearing losses. A paired-comparison procedure was used to compare all possible combinations of three conditions. The conditions were: speech in babble with no processing (NP) or processed using the RNN or SS. In each trial, the same sentence was played twice using two different conditions. The participants indicated which one was better and by how much in terms of speech intelligibility and (in separate blocks) sound quality. Processing using the RNN was significantly preferred over NP and over SS processing for both subjective intelligibility and sound quality, although the magnitude of the preferences was small. SS processing was not significantly preferred over NP for either subjective intelligibility or sound quality. Objective computational measures of speech intelligibility predicted better intelligibility for RNN than for SS or NP.
Collapse
Affiliation(s)
- Mahmoud Keshavarzi
- Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| | - Tobias Goehring
- MRC Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, United Kingdom
| | - Richard E Turner
- Department of Engineering, University of Cambridge, Cambridge, United Kingdom
| | - Brian C J Moore
- Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
38
|
Healy EW, Delfarah M, Johnson EM, Wang D. A deep learning algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker and reverberation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:1378. [PMID: 31067936 PMCID: PMC6420339 DOI: 10.1121/1.5093547] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2018] [Revised: 02/06/2019] [Accepted: 02/19/2019] [Indexed: 05/20/2023]
Abstract
For deep learning based speech segregation to have translational significance as a noise-reduction tool, it must perform in a wide variety of acoustic environments. In the current study, performance was examined when target speech was subjected to interference from a single talker and room reverberation. Conditions were compared in which an algorithm was trained to remove both reverberation and interfering speech, or only interfering speech. A recurrent neural network incorporating bidirectional long short-term memory was trained to estimate the ideal ratio mask corresponding to target speech. Substantial intelligibility improvements were found for hearing-impaired (HI) and normal-hearing (NH) listeners across a range of target-to-interferer ratios (TIRs). HI listeners performed better with reverberation removed, whereas NH listeners demonstrated no difference. Algorithm benefit averaged 56 percentage points for the HI listeners at the least-favorable TIR, allowing these listeners to perform numerically better than young NH listeners without processing. The current study highlights the difficulty associated with perceiving speech in reverberant-noisy environments, and it extends the range of environments in which deep learning based speech segregation can be effectively applied. This increasingly wide array of environments includes not only a variety of background noises and interfering speech, but also room reverberation.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - Masood Delfarah
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric M Johnson
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
39
|
Multi-objective learning based speech enhancement method to increase speech quality and intelligibility for hearing aid device users. Biomed Signal Process Control 2019. [DOI: 10.1016/j.bspc.2018.09.010] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
40
|
Borrie SA, Barrett TS, Yoho SE. Autoscore: An open-source automated tool for scoring listener perception of speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:392. [PMID: 30710955 PMCID: PMC6347573 DOI: 10.1121/1.5087276] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 11/26/2018] [Accepted: 12/10/2018] [Indexed: 05/19/2023]
Abstract
Speech perception studies typically rely on trained research assistants to score orthographic listener transcripts for words correctly identified. While the accuracy of the human scoring protocol has been validated with strong intra- and inter-rater reliability, the process of hand-scoring the transcripts is time-consuming and resource intensive. Here, an open-source computer-based tool for automated scoring of listener transcripts is built (Autoscore) and validated on three different human-scored data sets. Results show that not only is Autoscore highly accurate, achieving approximately 99% accuracy, but extremely efficient. Thus, Autoscore affords a practical research tool, with clinical application, for scoring listener intelligibility of speech.
Collapse
Affiliation(s)
- Stephanie A Borrie
- Department of Communicative Disorders and Deaf Education, Utah State University, Logan, Utah 84322, USA
| | - Tyson S Barrett
- Department of Psychology, Utah State University, Logan, Utah 84322, USA
| | - Sarah E Yoho
- Department of Communicative Disorders and Deaf Education, Utah State University, Logan, Utah 84322, USA
| |
Collapse
|
41
|
RETRACTED ARTICLE: Deep convolutional neural network-based speech enhancement to improve speech intelligibility and quality for hearing-impaired listeners. Med Biol Eng Comput 2018; 57:757. [DOI: 10.1007/s11517-018-1933-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
42
|
Wang D, Chen J. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2018; 26:1702-1726. [PMID: 31223631 PMCID: PMC6586438 DOI: 10.1109/taslp.2018.2842159] [Citation(s) in RCA: 132] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.
Collapse
Affiliation(s)
- DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA, and also with the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University, Xi'an 710072, China
| | - Jitong Chen
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA. He is now with Silicon Valley AI Lab, Baidu Research, Sunnyvale, CA 94089 USA
| |
Collapse
|
43
|
Healy EW, Vasko JL. An ideal quantized mask to increase intelligibility and quality of speech in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:1392. [PMID: 30424638 PMCID: PMC6136922 DOI: 10.1121/1.5053115] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 08/06/2018] [Accepted: 08/20/2018] [Indexed: 05/25/2023]
Abstract
Time-frequency (T-F) masks represent powerful tools to increase the intelligibility of speech in background noise. Translational relevance is provided by their accurate estimation based only on the signal-plus-noise mixture, using deep learning or other machine-learning techniques. In the current study, a technique is designed to capture the benefits of existing techniques. In the ideal quantized mask (IQM), speech and noise are partitioned into T-F units, and each unit receives one of N attenuations according to its signal-to-noise ratio. It was found that as few as four to eight attenuation steps (IQM4, IQM8) improved intelligibility over the ideal binary mask (IBM, having two attenuation steps), and equaled the intelligibility resulting from the ideal ratio mask (IRM, having a theoretically infinite number of steps). Sound-quality ratings and rankings of noisy speech processed by the IQM4 and IQM8 were also superior to that processed by the IBM and equaled or exceeded that processed by the IRM. It is concluded that the intelligibility and sound-quality advantages of infinite attenuation resolution can be captured by an IQM having only a very small number of steps. Further, the classification-based nature of the IQM might provide algorithmic advantages over the regression-based IRM during machine estimation.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - Jordan L Vasko
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
44
|
Zhao Y, Wang D, Johnson EM, Healy EW. A deep learning based segregation algorithm to increase speech intelligibility for hearing-impaired listeners in reverberant-noisy conditions. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:1627. [PMID: 30424625 PMCID: PMC6167229 DOI: 10.1121/1.5055562] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2018] [Revised: 08/27/2018] [Accepted: 09/06/2018] [Indexed: 05/20/2023]
Abstract
Recently, deep learning based speech segregation has been shown to improve human speech intelligibility in noisy environments. However, one important factor not yet considered is room reverberation, which characterizes typical daily environments. The combination of reverberation and background noise can severely degrade speech intelligibility for hearing-impaired (HI) listeners. In the current study, a deep learning based time-frequency masking algorithm was proposed to address both room reverberation and background noise. Specifically, a deep neural network was trained to estimate the ideal ratio mask, where anechoic-clean speech was considered as the desired signal. Intelligibility testing was conducted under reverberant-noisy conditions with reverberation time T 60 = 0.6 s, plus speech-shaped noise or babble noise at various signal-to-noise ratios. The experiments demonstrated that substantial speech intelligibility improvements were obtained for HI listeners. The algorithm was also somewhat beneficial for normal-hearing (NH) listeners. In addition, sentence intelligibility scores for HI listeners with algorithm processing approached or matched those of young-adult NH listeners without processing. The current study represents a step toward deploying deep learning algorithms to help the speech understanding of HI listeners in everyday conditions.
Collapse
Affiliation(s)
- Yan Zhao
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric M Johnson
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric W Healy
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
45
|
Bramsløw L, Naithani G, Hafez A, Barker T, Pontoppidan NH, Virtanen T. Improving competing voices segregation for hearing impaired listeners using a low-latency deep neural network algorithm. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:172. [PMID: 30075667 DOI: 10.1121/1.5045322] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Hearing aid users are challenged in listening situations with noise and especially speech-on-speech situations with two or more competing voices. Specifically, the task of attending to and segregating two competing voices is particularly hard, unlike for normal-hearing listeners, as shown in a small sub-experiment. In the main experiment, the competing voices benefit of a deep neural network (DNN) based stream segregation enhancement algorithm was tested on hearing-impaired listeners. A mixture of two voices was separated using a DNN and presented to the two ears as individual streams and tested for word score. Compared to the unseparated mixture, there was a 13%-point benefit from the separation, while attending to both voices. If only one output was selected as in a traditional target-masker scenario, a larger benefit of 37%-points was found. The results agreed well with objective metrics and show that for hearing-impaired listeners, DNNs have a large potential for improving stream segregation and speech intelligibility in difficult scenarios with two equally important targets without any prior selection of a primary target stream. An even higher benefit can be obtained if the user can select the preferred target via remote control.
Collapse
Affiliation(s)
- Lars Bramsløw
- Eriksholm Research Centre, Oticon A/S, Rørtangvej 20, DK-3070 Snekkersten, Denmark
| | - Gaurav Naithani
- Tampere University of Technology, Laboratory of Signal Processing, Tampere, P.O. Box 553, FI-33101 Tampere, Finland
| | - Atefeh Hafez
- Eriksholm Research Centre, Oticon A/S, Rørtangvej 20, DK-3070 Snekkersten, Denmark
| | - Tom Barker
- Tampere University of Technology, Laboratory of Signal Processing, Tampere, P.O. Box 553, FI-33101 Tampere, Finland
| | | | - Tuomas Virtanen
- Tampere University of Technology, Laboratory of Signal Processing, Tampere, P.O. Box 553, FI-33101 Tampere, Finland
| |
Collapse
|
46
|
Montazeri V, Assmann PF. Constraints on ideal binary masking for the perception of spectrally-reduced speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:EL59. [PMID: 30075663 DOI: 10.1121/1.5046442] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Accepted: 06/26/2018] [Indexed: 06/08/2023]
Abstract
This study investigated recognition of sentences processed using ideal binary masking (IBM) with limited spectral resolution. Local thresholds (LCs) of -12, 0, and 5 dB were applied which altered the target and masker power following IBM. Recognition was reduced due to persistence of the masker and limited target recovery, thus preventing IBM from ideal target-masker segregation. Linear regression and principal component analyses showed that, regardless of masker type and number of spectral channels, higher LCs were associated with poorer recognition. In addition, limitations on target recovery led to more detrimental effects on speech recognition compared to persistence of the masker.
Collapse
Affiliation(s)
- Vahid Montazeri
- School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, Texas 75080, USA ,
| | - Peter F Assmann
- School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, Texas 75080, USA ,
| |
Collapse
|
47
|
Bentsen T, May T, Kressner AA, Dau T. The benefit of combining a deep neural network architecture with ideal ratio mask estimation in computational speech segregation to improve speech intelligibility. PLoS One 2018; 13:e0196924. [PMID: 29763459 PMCID: PMC5953465 DOI: 10.1371/journal.pone.0196924] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Accepted: 04/23/2018] [Indexed: 11/19/2022] Open
Abstract
Computational speech segregation attempts to automatically separate speech from noise. This is challenging in conditions with interfering talkers and low signal-to-noise ratios. Recent approaches have adopted deep neural networks and successfully demonstrated speech intelligibility improvements. A selection of components may be responsible for the success with these state-of-the-art approaches: the system architecture, a time frame concatenation technique and the learning objective. The aim of this study was to explore the roles and the relative contributions of these components by measuring speech intelligibility in normal-hearing listeners. A substantial improvement of 25.4 percentage points in speech intelligibility scores was found going from a subband-based architecture, in which a Gaussian Mixture Model-based classifier predicts the distributions of speech and noise for each frequency channel, to a state-of-the-art deep neural network-based architecture. Another improvement of 13.9 percentage points was obtained by changing the learning objective from the ideal binary mask, in which individual time-frequency units are labeled as either speech- or noise-dominated, to the ideal ratio mask, where the units are assigned a continuous value between zero and one. Therefore, both components play significant roles and by combining them, speech intelligibility improvements were obtained in a six-talker condition at a low signal-to-noise ratio.
Collapse
Affiliation(s)
- Thomas Bentsen
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Kgs. Lyngby, Denmark
- * E-mail:
| | - Tobias May
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Abigail A. Kressner
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Torsten Dau
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Kgs. Lyngby, Denmark
| |
Collapse
|
48
|
Keshavarzi M, Goehring T, Zakis J, Turner RE, Moore BCJ. Use of a Deep Recurrent Neural Network to Reduce Wind Noise: Effects on Judged Speech Intelligibility and Sound Quality. Trends Hear 2018; 22:2331216518770964. [PMID: 29708061 PMCID: PMC5949931 DOI: 10.1177/2331216518770964] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Despite great advances in hearing-aid technology, users still experience problems with noise in windy environments. The potential benefits of using a deep recurrent neural network (RNN) for reducing wind noise were assessed. The RNN was trained using recordings of the output of the two microphones of a behind-the-ear hearing aid in response to male and female speech at various azimuths in the presence of noise produced by wind from various azimuths with a velocity of 3 m/s, using the “clean” speech as a reference. A paired-comparison procedure was used to compare all possible combinations of three conditions for subjective intelligibility and for sound quality or comfort. The conditions were unprocessed noisy speech, noisy speech processed using the RNN, and noisy speech that was high-pass filtered (which also reduced wind noise). Eighteen native English-speaking participants were tested, nine with normal hearing and nine with mild-to-moderate hearing impairment. Frequency-dependent linear amplification was provided for the latter. Processing using the RNN was significantly preferred over no processing by both subject groups for both subjective intelligibility and sound quality, although the magnitude of the preferences was small. High-pass filtering (HPF) was not significantly preferred over no processing. Although RNN was significantly preferred over HPF only for sound quality for the hearing-impaired participants, for the results as a whole, there was a preference for RNN over HPF. Overall, the results suggest that reduction of wind noise using an RNN is possible and might have beneficial effects when used in hearing aids.
Collapse
Affiliation(s)
| | - Tobias Goehring
- 2 MRC Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, UK
| | - Justin Zakis
- 3 Blamey and Saunders Hearing Pty Ltd, East Melbourne, Victoria, Australia
| | - Richard E Turner
- 4 Department of Engineering, University of Cambridge, Cambridge, UK
| | - Brian C J Moore
- 1 Department of Psychology, University of Cambridge, Cambridge, UK
| |
Collapse
|
49
|
Koning R, Bruce IC, Denys S, Wouters J. Perceptual and Model-Based Evaluation of Ideal Time-Frequency Noise Reduction in Hearing-Impaired Listeners. IEEE Trans Neural Syst Rehabil Eng 2018. [PMID: 29522412 DOI: 10.1109/tnsre.2018.2794557] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
State-of-the-art hearing aids (HAs) try to overcome the deficit of poor speech intelligibility (SI) in noisy listening environments using digital noise reduction (NR) techniques. The application of time-frequency masks to the noisy sound input is a common NR technique to increase SI. The binary mask with its binary weights and the Wiener filter with continuous weights are representatives of a hard- and a soft-decision approach for time-frequency masking. In normal-hearing listeners, the ideal Wiener filter (IWF) outperforms the ideal binary mask (IBM) in terms of SI and speech quality with perfect SI even at very low signal-to-noise ratios. In this paper, both approaches were investigated for hearing-impaired (HI) listeners. Perceptual and auditory model-based measures were used for the evaluation. The IWF outperformed the IBM in terms of SI. Quality-wise, there was no overall difference between the NR algorithms perceived. Additionally, the processed signals were evaluated based on an auditory nerve model using the neurogram similarity metric (NSIM). The mean NSIM values were significantly different for intelligible and unintelligible sentences. The results suggest that a soft-mask seems to be promising for application in HAs.
Collapse
|
50
|
Soleymani R, Selesnick IW, Landsberger DM. SEDA: A tunable Q-factor wavelet-based noise reduction algorithm for multi-talker babble. SPEECH COMMUNICATION 2018; 96:102-115. [PMID: 29606781 PMCID: PMC5875444 DOI: 10.1016/j.specom.2017.11.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We introduce a new wavelet-based algorithm to enhance the quality of speech corrupted by multi-talker babble noise. The algorithm comprises three stages: The first stage classifies short frames of the noisy speech as speech-dominated or noise-dominated. We design this classifier specifically for multi-talker babble noise. The second stage performs preliminary de-nosing of noisy speech frames using oversampled wavelet transforms and parallel group thresholding. The final stage performs further denoising by attenuating residual high frequency components in the signal produced by the second stage. A significant improvement in intelligibility and quality was observed in evaluation tests of the algorithm with cochlear implant users.
Collapse
Affiliation(s)
- Roozbeh Soleymani
- Department of Electrical and Computer Engineering, Tandon School of Engineering, New York University, 2 Metrotech Center, Brooklyn, NY 11201
- Department of Otolaryngology, New York University School of Medicine, 550 1 Avenue, STE NBV 5E5, New York, NY 10016 USA
| | - Ivan W. Selesnick
- Department of Electrical and Computer Engineering, Tandon School of Engineering, New York University, 2 Metrotech Center, Brooklyn, NY 11201
| | - David M. Landsberger
- Department of Otolaryngology, New York University School of Medicine, 550 1 Avenue, STE NBV 5E5, New York, NY 10016 USA
| |
Collapse
|