1
|
Fletcher MD, Perry SW, Thoidis I, Verschuur CA, Goehring T. Improved tactile speech robustness to background noise with a dual-path recurrent neural network noise-reduction method. Sci Rep 2024; 14:7357. [PMID: 38548750 PMCID: PMC10978864 DOI: 10.1038/s41598-024-57312-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 03/17/2024] [Indexed: 04/01/2024] Open
Abstract
Many people with hearing loss struggle to understand speech in noisy environments, making noise robustness critical for hearing-assistive devices. Recently developed haptic hearing aids, which convert audio to vibration, can improve speech-in-noise performance for cochlear implant (CI) users and assist those unable to access hearing-assistive devices. They are typically body-worn rather than head-mounted, allowing additional space for batteries and microprocessors, and so can deploy more sophisticated noise-reduction techniques. The current study assessed whether a real-time-feasible dual-path recurrent neural network (DPRNN) can improve tactile speech-in-noise performance. Audio was converted to vibration on the wrist using a vocoder method, either with or without noise reduction. Performance was tested for speech in a multi-talker noise (recorded at a party) with a 2.5-dB signal-to-noise ratio. An objective assessment showed the DPRNN improved the scale-invariant signal-to-distortion ratio by 8.6 dB and substantially outperformed traditional noise-reduction (log-MMSE). A behavioural assessment in 16 participants showed the DPRNN improved tactile-only sentence identification in noise by 8.2%. This suggests that advanced techniques like the DPRNN could substantially improve outcomes with haptic hearing aids. Low-cost haptic devices could soon be an important supplement to hearing-assistive devices such as CIs or offer an alternative for people who cannot access CI technology.
Collapse
Affiliation(s)
- Mark D Fletcher
- University of Southampton Auditory Implant Service, University of Southampton, University Road, Southampton, SO17 1BJ, UK.
- Institute of Sound and Vibration Research, University of Southampton, University Road, Southampton, SO17 1BJ, UK.
| | - Samuel W Perry
- University of Southampton Auditory Implant Service, University of Southampton, University Road, Southampton, SO17 1BJ, UK
- Institute of Sound and Vibration Research, University of Southampton, University Road, Southampton, SO17 1BJ, UK
| | - Iordanis Thoidis
- School of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece
| | - Carl A Verschuur
- University of Southampton Auditory Implant Service, University of Southampton, University Road, Southampton, SO17 1BJ, UK
| | - Tobias Goehring
- MRC Cognition and Brain Sciences Unit, University of Cambridge, 15 Chaucer Road, Cambridge, CB2 7EF, UK
| |
Collapse
|
2
|
Song Q, Sun B, Li S. Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:10028-10038. [PMID: 35412992 DOI: 10.1109/tnnls.2022.3163771] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Automatic speech recognition (ASR) is the major human-machine interface in many intelligent systems, such as intelligent homes, autonomous driving, and servant robots. However, its performance usually significantly deteriorates in the presence of external noise, leading to limitations of its application scenes. The audio-visual speech recognition (AVSR) takes visual information as a complementary modality to enhance the performance of audio speech recognition effectively, particularly in noisy conditions. Recently, the transformer-based architectures have been used to model the audio and video sequences for the AVSR, which achieves a superior performance. However, its performance may be degraded in these architectures due to extracting irrelevant information while modeling long-term dependences. In addition, the motion feature is essential for capturing the spatio-temporal information within the lip region to best utilize visual sequences but has not been considered in the AVSR tasks. Therefore, we propose a multimodal sparse transformer network (MMST) in this article. The sparse self-attention mechanism can improve the concentration of attention on global information by selecting the most relevant parts wisely. Moreover, the motion features are seamlessly introduced into the MMST model. We subtly allow motion-modality information to flow into visual modality through the cross-modal attention module to enhance visual features, thereby further improving recognition performance. Extensive experiments conducted on different datasets validate that our proposed method outperforms several state-of-the-art methods in terms of the word error rate (WER).
Collapse
|
3
|
Rascon C. Characterization of Deep Learning-Based Speech-Enhancement Techniques in Online Audio Processing Applications. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23094394. [PMID: 37177598 PMCID: PMC10181690 DOI: 10.3390/s23094394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 04/24/2023] [Accepted: 04/28/2023] [Indexed: 05/15/2023]
Abstract
Deep learning-based speech-enhancement techniques have recently been an area of growing interest, since their impressive performance can potentially benefit a wide variety of digital voice communication systems. However, such performance has been evaluated mostly in offline audio-processing scenarios (i.e., feeding the model, in one go, a complete audio recording, which may extend several seconds). It is of significant interest to evaluate and characterize the current state-of-the-art in applications that process audio online (i.e., feeding the model a sequence of segments of audio data, concatenating the results at the output end). Although evaluations and comparisons between speech-enhancement techniques have been carried out before, as far as the author knows, the work presented here is the first that evaluates the performance of such techniques in relation to their online applicability. This means that this work measures how the output signal-to-interference ratio (as a separation metric), the response time, and memory usage (as online metrics) are impacted by the input length (the size of audio segments), in addition to the amount of noise, amount and number of interferences, and amount of reverberation. Three popular models were evaluated, given their availability on public repositories and online viability, MetricGAN+, Spectral Feature Mapping with Mimic Loss, and Demucs-Denoiser. The characterization was carried out using a systematic evaluation protocol based on the Speechbrain framework. Several intuitions are presented and discussed, and some recommendations for future work are proposed.
Collapse
Affiliation(s)
- Caleb Rascon
- Computer Science Department, Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autonoma de Mexico, Mexico City 3000, Mexico
| |
Collapse
|
4
|
Multiple time-instances features based approach for reference-free speech quality measurement. COMPUT SPEECH LANG 2023. [DOI: 10.1016/j.csl.2022.101478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
5
|
Sokolova A, Sengupta D, Hunt M, Gupta R, Aksanli B, Harris F, Garudadri H. Real-Time Multirate Multiband Amplification for Hearing Aids. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2022; 10:54301-54312. [PMID: 37309510 PMCID: PMC10260239 DOI: 10.1109/access.2022.3176368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Hearing loss is a common problem affecting the quality of life for thousands of people. However, many individuals with hearing loss are dissatisfied with the quality of modern hearing aids. Amplification is the main method of compensating for hearing loss in modern hearing aids. One common amplification technique is dynamic range compression, which maps audio signals onto a person's hearing range using an amplification curve. However, due to the frequency dependent nature of the human cochlea, compression is often performed independently in different frequency bands. This paper presents a real-time multirate multiband amplification system for hearing aids, which includes a multirate channelizer for separating an audio signal into eleven standard audiometric frequency bands, and an automatic gain control system for accurate control of the steady state and dynamic behavior of audio compression as specified by ANSI standards. The spectral channelizer offers high frequency resolution with low latency of 5.4 ms and about 14× improvement in complexity over a baseline design. Our automatic gain control includes a closed-form solution for satisfying any designated attack and release times for any desired compression parameters. The increased frequency resolution and precise gain adjustment allow our system to more accurately fulfill audiometric hearing aid prescriptions.
Collapse
Affiliation(s)
- Alice Sokolova
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA 92093, USA
- Department of Electrical and Computer Engineering, San Diego State University, San Diego, CA 92182, USA
| | - Dhiman Sengupta
- Department of Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA
| | - Martin Hunt
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA 92093, USA
| | - Rajesh Gupta
- Department of Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA
- Halıcıoğlu Data Science Institute, La Jolla, CA 92093, USA
| | - Baris Aksanli
- Department of Electrical and Computer Engineering, San Diego State University, San Diego, CA 92182, USA
| | - Fredric Harris
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA 92093, USA
| | | |
Collapse
|
6
|
Speech Enhancement Framework with Noise Suppression Using Block Principal Component Analysis. ACOUSTICS 2022. [DOI: 10.3390/acoustics4020027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
With the advancement in voice-communication-based human–machine interface technology in smart home devices, the ability to decompose the received speech signal into a signal of interest and an interference component has emerged as a key requirement for their successful operation. These devices perform their tasks in real time based on the received commands, and their effectiveness is limited when there is a lot of ambient noise in the area in which they operate. Most real-time speech enhancement algorithms do not perform adequately well in the presence of high amounts of noise (i.e., low input-signal-to-noise ratio). In this manuscript, we propose a speech enhancement framework to help these algorithms in situations when the noise level in the received signal is high. The proposed framework performs noise suppression in the frequency domain by generating an approximation of the noisy signals’ short-time Fourier transform, which is then used by the speech enhancement algorithms to recover the underlying clean signal. This approximation is performed by using the proposed block principal component analysis (Block-PCA) algorithm. To illustrate efficacy of the proposed framework, we present a detailed performance evaluation under different noise levels and noise types, highlighting the effectiveness of the proposed framework. Moreover, the proposed method can be used in conjunction with any speech enhancement algorithm to improve its performance under moderate to high noise scenarios.
Collapse
|
7
|
Kumar KP, Kanhe A. Secured Speech Watermarking with DCT Compression and Chaotic Embedding Using DWT and SVD. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2022. [DOI: 10.1007/s13369-021-06431-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
8
|
Feng Y, Chen F. Nonintrusive objective measurement of speech intelligibility: A review of methodology. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2021.103204] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
9
|
Sokolova A, Sengupta D, Chen KL, Gupta R, Aksanli B, Harris F, Garudadri H. Multirate Audiometric Filter Bank for Hearing Aid Devices. CONFERENCE RECORD. ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS 2021; 2021:1436-1442. [PMID: 35368329 PMCID: PMC8973212 DOI: 10.1109/ieeeconf53345.2021.9723257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The frequency-dependent nature of hearing loss poses many challenges for hearing aid design. In order to compensate for a hearing aid user's unique hearing loss pattern, an input signal often needs to be separated into frequency bands, or channels, through a process called sub-band decomposition. In this paper, we present a real-time filter bank for hearing aids. Our filter bank features 10 channels uniformly distributed on the logarithmic scale, located at the standard audiometric frequencies used for the characterization and fitting of hearing aids. We obtained filters with very narrow passbands in the lower frequencies by employing multi-rate signal processing. Our filter bank offers a 9.1× reduction in complexity as compared to conventional signal processing. We implemented our filter bank on Open Speech Platform, an open-source hearing aid, and confirmed real-time operation.
Collapse
Affiliation(s)
- Alice Sokolova
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA, USA
- Department of Electrical and Computer Engineering, San Diego State University, San Diego, CA, USA
| | - Dhiman Sengupta
- Department of Computer Science and Engineering, UC San Diego, La Jolla, CA, USA
| | - Kuan-Lin Chen
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA, USA
| | - Rajesh Gupta
- Department of Computer Science and Engineering, UC San Diego, La Jolla, CA, USA
| | - Baris Aksanli
- Department of Electrical and Computer Engineering, San Diego State University, San Diego, CA, USA
| | - Fredric Harris
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA, USA
| | - Harinath Garudadri
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA, USA
| |
Collapse
|
10
|
Kumar S, Singh S, Agarwal P, Acharya UK, Sethy PK, Pandey C. Speech quality evaluation for different pitch detection algorithms in LPC speech analysis–synthesis system. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 2021; 24:545-551. [DOI: 10.1007/s10772-020-09765-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 10/21/2020] [Indexed: 08/02/2023]
|
11
|
De-Noising Process in Room Impulse Response with Generalized Spectral Subtraction. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11156858] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The generalized spectral subtraction algorithm (GBSS), which has extraordinary ability in background noise reduction, is historically one of the first approaches used for speech enhancement and dereverberation. However, the algorithm has not been applied to de-noise the room impulse response (RIR) to extend the reverberation decay range. The application of the GBSS algorithm in this study is stated as an optimization problem, that is, subtracting the noise level from the RIR while maintaining the signal quality. The optimization process conducted in the measurements of the RIRs with artificial noise and natural ambient noise aims to determine the optimal sets of factors to achieve the best noise reduction results regarding the largest dynamic range improvement. The optimal factors are set variables determined by the estimated SNRs of the RIRs filtered in the octave band. The acoustic parameters, the reverberation time (RT), and early decay time (EDT), and the dynamic range improvement of the energy decay curve were used as control measures and evaluation criteria to ensure the reliability of the algorithm. The de-noising results were compared with noise compensation methods. With the achieved optimal factors, the GBSS contributes to a significant effect in terms of dynamic range improvement and decreases the estimation errors in the RTs caused by noise levels.
Collapse
|
12
|
Podusenko A, Kouw WM, de Vries B. Message Passing-Based Inference for Time-Varying Autoregressive Models. ENTROPY 2021; 23:e23060683. [PMID: 34071643 PMCID: PMC8227039 DOI: 10.3390/e23060683] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Revised: 05/21/2021] [Accepted: 05/24/2021] [Indexed: 11/16/2022]
Abstract
Time-varying autoregressive (TVAR) models are widely used for modeling of non-stationary signals. Unfortunately, online joint adaptation of both states and parameters in these models remains a challenge. In this paper, we represent the TVAR model by a factor graph and solve the inference problem by automated message passing-based inference for states and parameters. We derive structured variational update rules for a composite “AR node” with probabilistic observations that can be used as a plug-in module in hierarchical models, for example, to model the time-varying behavior of the hyper-parameters of a time-varying AR model. Our method includes tracking of variational free energy (FE) as a Bayesian measure of TVAR model performance. The proposed methods are verified on a synthetic data set and validated on real-world data from temperature modeling and speech enhancement tasks.
Collapse
Affiliation(s)
- Albert Podusenko
- Department of Electrical Engineering, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands; (W.M.K.); (B.d.V.)
- Correspondence:
| | - Wouter M. Kouw
- Department of Electrical Engineering, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands; (W.M.K.); (B.d.V.)
| | - Bert de Vries
- Department of Electrical Engineering, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands; (W.M.K.); (B.d.V.)
- GN Hearing, JF Kennedylaan 2, 5612 AB Eindhoven, The Netherlands
| |
Collapse
|
13
|
Shang W, Stevenson M. Detection of speech playback attacks using robust harmonic trajectories. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2020.101133] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
14
|
An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement. ELECTRONICS 2020. [DOI: 10.3390/electronics10010017] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Recent speech enhancement research has shown that deep learning techniques are very effective in removing background noise. Many deep neural networks are being proposed, showing promising results for improving overall speech perception. The Deep Multilayer Perceptron, Convolutional Neural Networks, and the Denoising Autoencoder are well-established architectures for speech enhancement; however, choosing between different deep learning models has been mainly empirical. Consequently, a comparative analysis is needed between these three architecture types in order to show the factors affecting their performance. In this paper, this analysis is presented by comparing seven deep learning models that belong to these three categories. The comparison includes evaluating the performance in terms of the overall quality of the output speech using five objective evaluation metrics and a subjective evaluation with 23 listeners; the ability to deal with challenging noise conditions; generalization ability; complexity; and, processing time. Further analysis is then provided while using two different approaches. The first approach investigates how the performance is affected by changing network hyperparameters and the structure of the data, including the Lombard effect. While the second approach interprets the results by visualizing the spectrogram of the output layer of all the investigated models, and the spectrograms of the hidden layers of the convolutional neural network architecture. Finally, a general evaluation is performed for supervised deep learning-based speech enhancement while using SWOC analysis, to discuss the technique’s Strengths, Weaknesses, Opportunities, and Challenges. The results of this paper contribute to the understanding of how different deep neural networks perform the speech enhancement task, highlight the strengths and weaknesses of each architecture, and provide recommendations for achieving better performance. This work facilitates the development of better deep neural networks for speech enhancement in the future.
Collapse
|
15
|
Jaisinghani P, Manjula P. Acoustical and Perceptual Analysis of Noise Reduction Strategies in Individuals With Auditory Neuropathy Spectrum Disorders. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2020; 63:4208-4218. [PMID: 33175645 DOI: 10.1044/2020_jslhr-20-00176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Purpose The conventional amplification devices render minimal or no benefit at abating the speech perception problems of individuals with auditory neuropathy spectrum disorder (ANSD). This study was undertaken to evaluate the effect of noise reduction strategies (multiband spectral subtraction, Wiener-as, Karhunen-Loeve transform [Subspace], and ideal binary mask [IdBM] algorithm) on speech using speech perception measures and acoustic measure among individuals with ANSD. Method Two groups of participants (age: 17-43 years) were recruited in the study. Group I comprised 12 individuals with a confirmed diagnosis of ANSD and not exceeding moderate degree of hearing loss and Group II of 10 individuals with normal hearing in both ears. The signal-to-noise required for 50% speech recognition (SNR-50) was measured for the participants in five conditions, that is, unprocessed speech and speech processed with four noise reduction strategies. Additionally, an acoustic objective measure Extended Short-Time Objective Intelligibility algorithm was employed to estimate the intelligibility index across the conditions. Results Significant difference was found across conditions in both the groups. Pairwise comparison revealed significantly better speech perception on SNR-50 measure with IdBM strategy, for both the groups. No significant difference in SNR-50 was observed with other noise reduction strategies. IdBM condition also gave the highest intelligibility index (d) values using Extended Short-Time Objective Intelligibility algorithm. This finding needs to be verified on a larger group of individuals with ANSD. Conclusions IdBM noise reduction strategy rendered significantly lower SNR-50 compared to other noise reduction strategies for individuals with ANSD in this study. This provides clinical evidence for the same and also recommends trying on a larger group of participants before its implementation in hearing devices. Apart from this, the current strategies used in hearing aids provide no improvement in speech identification in noise for this population. Hence, though the present hearing aids may show benefit in quiet condition, chances of its rejection are high in noisy backgrounds.
Collapse
Affiliation(s)
- Priyanka Jaisinghani
- Department of Audiology, All India Institute of Speech and Hearing,Manasagangothri, Mysore, Karnataka
| | - P Manjula
- Department of Audiology, All India Institute of Speech and Hearing,Manasagangothri, Mysore, Karnataka
| |
Collapse
|
16
|
Wang J, Yan L, Tian J, Yuan M. Speech enhancement algorithm of improved OMLSA based on bilateral spectrogram filtering. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-192088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In this paper, a bilateral spectrogram filtering (BSF)-based optimally modified log-spectral amplitude (OMLSA) estimator for single-channel speech enhancement is proposed, which can significantly improve the performance of OMLSA, especially in highly non-stationary noise environments, by taking advantage of bilateral filtering (BF), a widely used technology in image and visual processing, to preprocess the spectrogram of the noisy speech. BSF is capable of not only sharpening details, removing unwanted textures or background noise from the noisy speech spectrogram, but also preserving edges when considering a speech spectrogram as an image. The a posteriori signal-to-noise ratio (SNR) of OMLSA algorithm is estimated after applying BSF to the noisy speech. Besides, in order to reduce computing costs, a fast and accurate BF is adopted to reduce the algorithm complexity O(1) for each time-frequency bin. Finally, the proposed algorithm is compared with the original OMLSA and other classic denoising methods using various types of noise with different signal-to-noise ratios in terms of objective evaluation metrics such as segmental signal-to-noise ratio improvement and perceptual evaluation of speech quality. The results show the validity of the improved BSF-based OMLSA algorithm.
Collapse
Affiliation(s)
- Jie Wang
- Huangpu Research Institute, Guangzhou University, Guangzhou, China
- School of Electronics and Communication Engineering, Guangzhou University, Guangzhou, China
| | - Linhuang Yan
- School of Electronics and Communication Engineering, Guangzhou University, Guangzhou, China
- Linköping University - Guangzhou University Research Center on Urban Sustainable Development, Guangzhou University, Guangzhou, China
| | - Jiayi Tian
- School of Electronics and Communication Engineering, Guangzhou University, Guangzhou, China
- Linköping University - Guangzhou University Research Center on Urban Sustainable Development, Guangzhou University, Guangzhou, China
| | - Minmin Yuan
- Research Institute of Highway Ministry of Transport, Beijing, China
| |
Collapse
|
17
|
Speech Enhancement for Hearing Aids with Deep Learning on Environmental Noises. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10176077] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Hearing aids are small electronic devices designed to improve hearing for persons with impaired hearing, using sophisticated audio signal processing algorithms and technologies. In general, the speech enhancement algorithms in hearing aids remove the environmental noise and enhance speech while still giving consideration to hearing characteristics and the environmental surroundings. In this study, a speech enhancement algorithm was proposed to improve speech quality in a hearing aid environment by applying noise reduction algorithms with deep neural network learning based on noise classification. In order to evaluate the speech enhancement in an actual hearing aid environment, ten types of noise were self-recorded and classified using convolutional neural networks. In addition, noise reduction for speech enhancement in the hearing aid were applied by deep neural networks based on the noise classification. As a result, the speech quality based on the speech enhancements removed using the deep neural networks—and associated environmental noise classification—exhibited a significant improvement over that of the conventional hearing aid algorithm. The improved speech quality was also evaluated by objective measure through the perceptual evaluation of speech quality score, the short-time objective intelligibility score, the overall quality composite measure, and the log likelihood ratio score.
Collapse
|
18
|
Luo C, Pan C, Zheng D, Chen F. Cortical Characterization of Reverberation Time in Reverberant Speech. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2020:3314-3317. [PMID: 33018713 DOI: 10.1109/embc44109.2020.9175977] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Reverberation reduces speech quality, and therefore causes inconveniency to listeners, especially those using assistive hearing devices. To enhance the quality of reverberant speech, a significant step is speech quality assessment, most of which are based on subjective judgements. Subjective evaluations vary with listeners' perception, emotional and mental states. To obtain an objective assessment of speech quality in reverberation, this work carried out an event related potential (ERP) study using a passive oddball paradigm. Listeners were presented with anechoic speech as standard stimuli mixed with reverberant speech under different levels of reverberation as deviant stimuli. The ERP responses reveal how human-beings' subconsciousness interacts with different levels of reverberation in the perceived speech. Results showed that the peak amplitude of P300 in ERP responses followed the variation of reverberation time in reverberant speech, providing evidence that P300 in ERP responses could work as a neural surrogate of reverberation time in objective speech quality assessment.
Collapse
|
19
|
Jassim WA, Harte N. Estimation of a priori signal-to-noise ratio using neurograms for speech enhancement. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:3830. [PMID: 32611151 DOI: 10.1121/10.0001324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Accepted: 05/12/2020] [Indexed: 06/11/2023]
Abstract
In statistical-based speech enhancement algorithms, the a priori signal-to-noise ratio (SNR) must be estimated to calculate the required spectral gain function. This paper proposes a method to improve this estimation using features derived from the neural responses of the auditory-nerve (AN) system. The neural responses, interpreted as a neurogram (NG), are simulated for noisy speech using a computational model of the AN system with a range of characteristic frequencies (CFs). Two machine learning algorithms were explored to train the estimation model based on NG features: support vector regression and a convolutional neural network. The proposed estimator was placed in a common speech enhancement system, and three conventional spectral gain functions were employed to estimate the enhanced signal. The proposed method was tested using the NOIZEUS database at different SNR levels, and various speech quality and intelligibility measures were employed for performance evaluation. The a priori SNR estimated from NG features achieved better quality and intelligibility scores than that of recent estimators, especially for highly distorted speech and low SNR values.
Collapse
Affiliation(s)
- Wissam A Jassim
- Sigmedia Group, ADAPT Centre, School of Engineering, Trinity College Dublin, Ireland
| | - Naomi Harte
- Sigmedia Group, ADAPT Centre, School of Engineering, Trinity College Dublin, Ireland
| |
Collapse
|
20
|
H Y V, M A A. Improving speech recognition using bionic wavelet features. AIMS ELECTRONICS AND ELECTRICAL ENGINEERING 2020. [DOI: 10.3934/electreng.2020.2.200] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
21
|
Design of MELPe-Based Variable-Bit-Rate Speech Coding with Mel Scale Approach Using Low-Order Linear Prediction Filter and Representing Excitation Signal Using Glottal Closure Instants. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2019. [DOI: 10.1007/s13369-019-04273-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
22
|
Jassim WA, Zilany MS. NSQM: A non-intrusive assessment of speech quality using normalized energies of the neurogram. COMPUT SPEECH LANG 2019. [DOI: 10.1016/j.csl.2019.04.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
23
|
Soleymani R, Selesnick IW, Landsberger DM. ALTIS: A new algorithm for adaptive long-term SNR estimation in multi-talker babble. COMPUT SPEECH LANG 2019; 58:231-246. [DOI: 10.1016/j.csl.2019.05.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
24
|
Kanhe A, Gnanasekaran A. A QIM-Based Energy Modulation Scheme for Audio Watermarking Robust to Synchronization Attack. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2019. [DOI: 10.1007/s13369-018-3540-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
25
|
Keshavarzi M, Goehring T, Turner RE, Moore BCJ. Comparison of effects on subjective intelligibility and quality of speech in babble for two algorithms: A deep recurrent neural network and spectral subtraction. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:1493. [PMID: 31067946 DOI: 10.1121/1.5094765] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Accepted: 03/01/2019] [Indexed: 06/09/2023]
Abstract
The effects on speech intelligibility and sound quality of two noise-reduction algorithms were compared: a deep recurrent neural network (RNN) and spectral subtraction (SS). The RNN was trained using sentences spoken by a large number of talkers with a variety of accents, presented in babble. Different talkers were used for testing. Participants with mild-to-moderate hearing loss were tested. Stimuli were given frequency-dependent linear amplification to compensate for the individual hearing losses. A paired-comparison procedure was used to compare all possible combinations of three conditions. The conditions were: speech in babble with no processing (NP) or processed using the RNN or SS. In each trial, the same sentence was played twice using two different conditions. The participants indicated which one was better and by how much in terms of speech intelligibility and (in separate blocks) sound quality. Processing using the RNN was significantly preferred over NP and over SS processing for both subjective intelligibility and sound quality, although the magnitude of the preferences was small. SS processing was not significantly preferred over NP for either subjective intelligibility or sound quality. Objective computational measures of speech intelligibility predicted better intelligibility for RNN than for SS or NP.
Collapse
Affiliation(s)
- Mahmoud Keshavarzi
- Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| | - Tobias Goehring
- MRC Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, United Kingdom
| | - Richard E Turner
- Department of Engineering, University of Cambridge, Cambridge, United Kingdom
| | - Brian C J Moore
- Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
26
|
Bouserhal RE, Bernier A, Voix J. An in-ear speech database in varying conditions of the audio-phonation loop. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:1069. [PMID: 30823824 DOI: 10.1121/1.5091777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 02/05/2019] [Indexed: 06/09/2023]
Abstract
With the rise of hearables and the advantages of using in-ear microphones with intra-aural devices, accessibility to an in-ear speech database in adverse conditions is essential. Speech captured inside the occluded ear is limited in its frequency bandwidth and has an amplified low frequency content. In addition, occluding the ear canal affects speech production, especially in noisy environments. These changes to speech production have a detrimental effect on speech-based algorithms. Yet, to the authors' knowledge, there are no speech databases that account for these changes. This paper presents a speech-in-ear database, of speech captured inside an occluded ear in noise and in quiet. The database is bilingual (in French and in English) and is intended to aid researchers in developing algorithms for intra-aural devices utilizing in-ear microphones.
Collapse
Affiliation(s)
- Rachel E Bouserhal
- École de technologie supérieure, 1100 Rue Notre-Dame O, Montréal, Québec, Canada
| | - Antoine Bernier
- École de technologie supérieure, 1100 Rue Notre-Dame O, Montréal, Québec, Canada
| | - Jérémie Voix
- École de technologie supérieure, 1100 Rue Notre-Dame O, Montréal, Québec, Canada
| |
Collapse
|
27
|
Improvements in Spoken Query System to Access the Agricultural Commodity Prices and Weather Information in Kannada Language/Dialects. JOURNAL OF INTELLIGENT SYSTEMS 2018. [DOI: 10.1515/jisys-2018-0120] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Abstract
In this paper, the improvements in the recently developed end to end spoken query system to access the agricultural commodity prices and weather information in Kannada language/dialects is demonstrated. The spoken query system consists of interactive voice response system (IVRS) call flow, automatic speech recognition (ASR) models and agricultural commodity prices, and weather information databases. The task specific speech data used in the earlier spoken query system had a high level of background and other types of noises as it is collected from the farmers of Karnataka state (a state in India that speaks the Kannada language) under uncontrolled environment. The different types of noises present in collected speech data had an adverse effect on the on-line and off-line recognition performances. To improve the recognition accuracy in spoken query system, a noise elimination algorithm is proposed in this work, which is a combination of spectral subtraction with voice activity detection (SS-VAD) and minimum mean square error spectrum power estimator based on zero crossing (MMSE-SPZC). The noise elimination algorithm is added in the system before the feature extraction part. In addition to this, alternate acoustic models are developed using subspace Gaussian mixture models (SGMM) and deep neural network (DNN). The experimental results show that these modeling techniques are more powerful than the conventional Gaussian mixture model (GMM) – hidden Markov model (HMM), which was used as a modeling technique for the development of ASR models to design earlier spoken query systems. The fusion of noise elimination technique and SGMM/DNN-based modeling gives a better relative improvement of 7% accuracy compared to the earlier GMM-HMM-based ASR system. The least word error rate (WER) acoustic models could be used in spoken query system. The on-line speech recognition accuracy testing of developed spoken query system (with the help of Karnataka farmers) is also presented in this work.
Collapse
|
28
|
Koning R, Bruce IC, Denys S, Wouters J. Perceptual and Model-Based Evaluation of Ideal Time-Frequency Noise Reduction in Hearing-Impaired Listeners. IEEE Trans Neural Syst Rehabil Eng 2018. [PMID: 29522412 DOI: 10.1109/tnsre.2018.2794557] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
State-of-the-art hearing aids (HAs) try to overcome the deficit of poor speech intelligibility (SI) in noisy listening environments using digital noise reduction (NR) techniques. The application of time-frequency masks to the noisy sound input is a common NR technique to increase SI. The binary mask with its binary weights and the Wiener filter with continuous weights are representatives of a hard- and a soft-decision approach for time-frequency masking. In normal-hearing listeners, the ideal Wiener filter (IWF) outperforms the ideal binary mask (IBM) in terms of SI and speech quality with perfect SI even at very low signal-to-noise ratios. In this paper, both approaches were investigated for hearing-impaired (HI) listeners. Perceptual and auditory model-based measures were used for the evaluation. The IWF outperformed the IBM in terms of SI. Quality-wise, there was no overall difference between the NR algorithms perceived. Additionally, the processed signals were evaluated based on an auditory nerve model using the neurogram similarity metric (NSIM). The mean NSIM values were significantly different for intelligible and unintelligible sentences. The results suggest that a soft-mask seems to be promising for application in HAs.
Collapse
|
29
|
Estimation of glottal closure instants from degraded speech using a phase-difference-based algorithm. COMPUT SPEECH LANG 2017. [DOI: 10.1016/j.csl.2017.05.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
30
|
Kortlang S, Chen Z, Gerkmann T, Kollmeier B, Hohmann V, Ewert SD. Evaluation of combined dynamic compression and single channel noise reduction for hearing aid applications. Int J Audiol 2017; 57:S43-S54. [PMID: 28355947 DOI: 10.1080/14992027.2017.1300695] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
OBJECTIVE Single-channel noise reduction (SCNR) and dynamic range compression (DRC) are important elements in hearing aids. Only relatively few studies have addressed interaction effects and typically used real hearing aids with limited knowledge about the integrated algorithms. Here the potential benefit of different combinations and integration of SCNR and DRC was systematically assessed. DESIGN Ten different systems combining SCNR and DRC were implemented, including five serial arrangements, a parallel and two multiplicative approaches. In an instrumental evaluation, signal-to-noise ratio (SNR) improvement and spectral contrast enhancement (SCE) were assessed. Quality ratings at 0 and +6 dB SNR, and speech reception thresholds (SRTs) in noise were measured using stationary and babble noise. STUDY SAMPLE Thirteen young normal-hearing (NH) listeners and 12 hearing-impaired (HI) listeners participated. RESULTS In line with an increased segmental SNR and spectral contrast compared to a serial concatenation, the parallel approach significantly reduced the perceived noise annoyance for both subject groups. The proposed multiplicative approaches could partly counteract increased speech distortions introduced by DRC and achieved the best overall quality for the HI listeners. CONCLUSIONS For high SNRs well above the individual SRT, the specific combination of SCNR and DRC is perceptually relevant and the integrative approaches were preferred.
Collapse
Affiliation(s)
- Steffen Kortlang
- a Medizinische Physik and Cluster of Excellence Hearing4all , Universität Oldenburg , Oldenburg , Germany and
| | - Zhangli Chen
- a Medizinische Physik and Cluster of Excellence Hearing4all , Universität Oldenburg , Oldenburg , Germany and
| | - Timo Gerkmann
- b Speech Signal Processing and Cluster of Excellence Hearing4all , Universität Oldenburg , Oldenburg , Germany
| | - Birger Kollmeier
- a Medizinische Physik and Cluster of Excellence Hearing4all , Universität Oldenburg , Oldenburg , Germany and
| | - Volker Hohmann
- a Medizinische Physik and Cluster of Excellence Hearing4all , Universität Oldenburg , Oldenburg , Germany and
| | - Stephan D Ewert
- a Medizinische Physik and Cluster of Excellence Hearing4all , Universität Oldenburg , Oldenburg , Germany and
| |
Collapse
|
31
|
Sun P, Qin J. Speech enhancement via two-stage dual tree complex wavelet packet transform with a speech presence probability estimator. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 141:808. [PMID: 28253659 DOI: 10.1121/1.4976049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this paper, a two-stage dual tree complex wavelet packet transform (DTCWPT) based speech enhancement algorithm has been proposed, in which a speech presence probability (SPP) estimator and a generalized minimum mean squared error (MMSE) estimator are developed. To overcome the drawback of signal distortions caused by down sampling of wavelet packet transform (WPT), a two-stage analytic decomposition concatenating undecimated wavelet packet transform (UWPT) and decimated WPT is employed. An SPP estimator in the DTCWPT domain is derived based on a generalized Gamma distribution of speech, and Gaussian noise assumption. The validation results show that the proposed algorithm can obtain enhanced perceptual evaluation of speech quality (PESQ), and segmental signal-to-noise ratio (SegSNR) at low signal-to-noise ratio (SNR) nonstationary noise, compared with four other state-of-the-art speech enhancement algorithms, including optimally modified log-spectral amplitude (OM-LSA), soft masking using a posteriori SNR uncertainty (SMPO), a posteriori SPP based MMSE estimation (MMSE-SPP), and adaptive Bayesian wavelet thresholding (BWT).
Collapse
Affiliation(s)
- Pengfei Sun
- Department of Electrical and Computer Engineering, Southern Illinois University Carbondale, Illinois 62901, USA
| | - Jun Qin
- Department of Electrical and Computer Engineering, Southern Illinois University Carbondale, Illinois 62901, USA
| |
Collapse
|
32
|
Huber R, Bisitz T, Gerkmann T, Kiessling J, Meister H, Kollmeier B. Comparison of single-microphone noise reduction schemes: can hearing impaired listeners tell the difference? Int J Audiol 2017; 57:S55-S61. [PMID: 28112001 DOI: 10.1080/14992027.2017.1279758] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
OBJECTIVE The perceived qualities of nine different single-microphone noise reduction (SMNR) algorithms were to be evaluated and compared in subjective listening tests with normal hearing and hearing impaired (HI) listeners. DESIGN Speech samples added with traffic noise or with party noise were processed by the SMNR algorithms. Subjects rated the amount of speech distortions, intrusiveness of background noise, listening effort and overall quality, using a simplified MUSHRA (ITU-R, 2003 ) assessment method. STUDY SAMPLE 18 normal hearing and 18 moderately HI subjects participated in the study. RESULTS Significant differences between the rating behaviours of the two subject groups were observed: While normal hearing subjects clearly differentiated between different SMNR algorithms, HI subjects rated all processed signals very similarly. Moreover, HI subjects rated speech distortions of the unprocessed, noisier signals as being more severe than the distortions of the processed signals, in contrast to normal hearing subjects. CONCLUSIONS It seems harder for HI listeners to distinguish between additive noise and speech distortions or/and they might have a different understanding of the term "speech distortion" than normal hearing listeners have. The findings confirm that the evaluation of SMNR schemes for hearing aids should always involve HI listeners.
Collapse
Affiliation(s)
- Rainer Huber
- a HörTech gGmbH and Cluster of Excellence Hearing4All , Oldenburg , Germany
| | - Thomas Bisitz
- a HörTech gGmbH and Cluster of Excellence Hearing4All , Oldenburg , Germany
| | - Timo Gerkmann
- b Department of Medical Physics and Acoustics , University of Oldenburg, and Cluster of Excellence Hearing4All , Oldenburg , Germany
| | - Jürgen Kiessling
- c Funktionsbereich Audiologie , Justus-Liebig University Giessen , Giessen , Germany , and
| | - Hartmut Meister
- d Jean Uhrmacher Institute for Clinical ENT-Research , University of Cologne , Cologne , Germany
| | - Birger Kollmeier
- b Department of Medical Physics and Acoustics , University of Oldenburg, and Cluster of Excellence Hearing4All , Oldenburg , Germany
| |
Collapse
|
33
|
Yao R, Zeng Z, Zhu P. A priori SNR estimation and noise estimation for speech enhancement. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING 2016; 2016:101. [PMID: 27729928 PMCID: PMC5031741 DOI: 10.1186/s13634-016-0398-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2016] [Accepted: 09/12/2016] [Indexed: 06/06/2023]
Abstract
A priori signal-to-noise ratio (SNR) estimation and noise estimation are important for speech enhancement. In this paper, a novel modified decision-directed (DD) a priori SNR estimation approach based on single-frequency entropy, named DDBSE, is proposed. DDBSE replaces the fixed weighting factor in the DD approach with an adaptive one calculated according to change of single-frequency entropy. Simultaneously, a new noise power estimation approach based on unbiased minimum mean square error (MMSE) and voice activity detection (VAD), named UMVAD, is proposed. UMVAD adopts different strategies to estimate noise in order to reduce over-estimation and under-estimation of noise. UMVAD improves the classical statistical model-based VAD by utilizing an adaptive threshold to replace the original fixed one and modifies the unbiased MMSE-based noise estimation approach using an adaptive a priori speech presence probability calculated by entropy instead of the original fixed one. Experimental results show that DDBSE can provide greater noise suppression than DD and UMVAD can improve the accuracy of noise estimation. Compared to existing approaches, speech enhancement based on UMVAD and DDBSE can obtain a better segment SNR score and composite measure covl score, especially in adverse environments such as non-stationary noise and low-SNR.
Collapse
Affiliation(s)
- Rui Yao
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - ZeQing Zeng
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Ping Zhu
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| |
Collapse
|
34
|
A New Biologically Inspired Fuzzy Expert System-Based Voiced/Unvoiced Decision Algorithm for Speech Enhancement. Cognit Comput 2016. [DOI: 10.1007/s12559-015-9376-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
35
|
Speech enhancement based on wavelet packet of an improved principal component analysis. COMPUT SPEECH LANG 2016. [DOI: 10.1016/j.csl.2015.06.001] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
36
|
Baumgärtel RM, Hu H, Krawczyk-Becker M, Marquardt D, Herzke T, Coleman G, Adiloğlu K, Bomke K, Plotz K, Gerkmann T, Doclo S, Kollmeier B, Hohmann V, Dietz M. Comparing Binaural Pre-processing Strategies II: Speech Intelligibility of Bilateral Cochlear Implant Users. Trends Hear 2015; 19:19/0/2331216515617917. [PMID: 26721921 PMCID: PMC4771034 DOI: 10.1177/2331216515617917] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Several binaural audio signal enhancement algorithms were evaluated with respect to their potential to improve speech intelligibility in noise for users of bilateral cochlear implants (CIs). 50% speech reception thresholds (SRT50) were assessed using an adaptive procedure in three distinct, realistic noise scenarios. All scenarios were highly nonstationary, complex, and included a significant amount of reverberation. Other aspects, such as the perfectly frontal target position, were idealized laboratory settings, allowing the algorithms to perform better than in corresponding real-world conditions. Eight bilaterally implanted CI users, wearing devices from three manufacturers, participated in the study. In all noise conditions, a substantial improvement in SRT50 compared to the unprocessed signal was observed for most of the algorithms tested, with the largest improvements generally provided by binaural minimum variance distortionless response (MVDR) beamforming algorithms. The largest overall improvement in speech intelligibility was achieved by an adaptive binaural MVDR in a spatially separated, single competing talker noise scenario. A no-pre-processing condition and adaptive differential microphones without a binaural link served as the two baseline conditions. SRT50 improvements provided by the binaural MVDR beamformers surpassed the performance of the adaptive differential microphones in most cases. Speech intelligibility improvements predicted by instrumental measures were shown to account for some but not all aspects of the perceptually obtained SRT50 improvements measured in bilaterally implanted CI users.
Collapse
Affiliation(s)
- Regina M Baumgärtel
- Medizinische Physik, Carl von Ossietzky Universität, Oldenburg, Germany Cluster of Excellence 'Hearing4all', Oldenburg, Germany
| | - Hongmei Hu
- Medizinische Physik, Carl von Ossietzky Universität, Oldenburg, Germany Cluster of Excellence 'Hearing4all', Oldenburg, Germany
| | - Martin Krawczyk-Becker
- Cluster of Excellence 'Hearing4all', Oldenburg, Germany Speech Signal Processing Group, Carl von Ossietzky Universität Oldenburg, Germany
| | - Daniel Marquardt
- Cluster of Excellence 'Hearing4all', Oldenburg, Germany Signal Processing Group, Carl von Ossietzky Universität Oldenburg, Germany
| | - Tobias Herzke
- Cluster of Excellence 'Hearing4all', Oldenburg, Germany HörTech gGmbH, Oldenburg, Germany
| | - Graham Coleman
- Cluster of Excellence 'Hearing4all', Oldenburg, Germany HörTech gGmbH, Oldenburg, Germany
| | - Kamil Adiloğlu
- Cluster of Excellence 'Hearing4all', Oldenburg, Germany HörTech gGmbH, Oldenburg, Germany
| | - Katrin Bomke
- Cochlear-Implant Centrum CIC, Klinik für Phoniatrie und Pädaudiologie, Ev. Krankenhaus Oldenburg, Germany
| | - Karsten Plotz
- Cochlear-Implant Centrum CIC, Klinik für Phoniatrie und Pädaudiologie, Ev. Krankenhaus Oldenburg, Germany
| | - Timo Gerkmann
- Cluster of Excellence 'Hearing4all', Oldenburg, Germany Speech Signal Processing Group, Carl von Ossietzky Universität Oldenburg, Germany
| | - Simon Doclo
- Cluster of Excellence 'Hearing4all', Oldenburg, Germany Signal Processing Group, Carl von Ossietzky Universität Oldenburg, Germany
| | - Birger Kollmeier
- Medizinische Physik, Carl von Ossietzky Universität, Oldenburg, Germany Cluster of Excellence 'Hearing4all', Oldenburg, Germany
| | - Volker Hohmann
- Medizinische Physik, Carl von Ossietzky Universität, Oldenburg, Germany Cluster of Excellence 'Hearing4all', Oldenburg, Germany HörTech gGmbH, Oldenburg, Germany
| | - Mathias Dietz
- Medizinische Physik, Carl von Ossietzky Universität, Oldenburg, Germany Cluster of Excellence 'Hearing4all', Oldenburg, Germany
| |
Collapse
|
37
|
Baumgärtel RM, Krawczyk-Becker M, Marquardt D, Völker C, Hu H, Herzke T, Coleman G, Adiloğlu K, Ernst SMA, Gerkmann T, Doclo S, Kollmeier B, Hohmann V, Dietz M. Comparing Binaural Pre-processing Strategies I: Instrumental Evaluation. Trends Hear 2015; 19:19/0/2331216515617916. [PMID: 26721920 PMCID: PMC4771044 DOI: 10.1177/2331216515617916] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
In a collaborative research project, several monaural and binaural noise reduction algorithms have been comprehensively evaluated. In this article, eight selected noise reduction algorithms were assessed using instrumental measures, with a focus on the instrumental evaluation of speech intelligibility. Four distinct, reverberant scenarios were created to reflect everyday listening situations: a stationary speech-shaped noise, a multitalker babble noise, a single interfering talker, and a realistic cafeteria noise. Three instrumental measures were employed to assess predicted speech intelligibility and predicted sound quality: the intelligibility-weighted signal-to-noise ratio, the short-time objective intelligibility measure, and the perceptual evaluation of speech quality. The results show substantial improvements in predicted speech intelligibility as well as sound quality for the proposed algorithms. The evaluated coherence-based noise reduction algorithm was able to provide improvements in predicted audio signal quality. For the tested single-channel noise reduction algorithm, improvements in intelligibility-weighted signal-to-noise ratio were observed in all but the nonstationary cafeteria ambient noise scenario. Binaural minimum variance distortionless response beamforming algorithms performed particularly well in all noise scenarios.
Collapse
Affiliation(s)
- Regina M Baumgärtel
- Medical Physics Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany Cluster of Excellence "Hearing4all," Oldenburg, Germany
| | - Martin Krawczyk-Becker
- Cluster of Excellence "Hearing4all," Oldenburg, Germany Speech Signal Processing Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany
| | - Daniel Marquardt
- Cluster of Excellence "Hearing4all," Oldenburg, Germany Signal Processing Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany
| | - Christoph Völker
- Medical Physics Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany Cluster of Excellence "Hearing4all," Oldenburg, Germany
| | - Hongmei Hu
- Medical Physics Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany Cluster of Excellence "Hearing4all," Oldenburg, Germany
| | - Tobias Herzke
- Cluster of Excellence "Hearing4all," Oldenburg, Germany HörTech gGmbH, Oldenburg, Germany
| | - Graham Coleman
- Cluster of Excellence "Hearing4all," Oldenburg, Germany HörTech gGmbH, Oldenburg, Germany
| | - Kamil Adiloğlu
- Cluster of Excellence "Hearing4all," Oldenburg, Germany HörTech gGmbH, Oldenburg, Germany
| | - Stephan M A Ernst
- Medical Physics Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany Cluster of Excellence "Hearing4all," Oldenburg, Germany
| | - Timo Gerkmann
- Cluster of Excellence "Hearing4all," Oldenburg, Germany Speech Signal Processing Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany
| | - Simon Doclo
- Cluster of Excellence "Hearing4all," Oldenburg, Germany Signal Processing Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany
| | - Birger Kollmeier
- Medical Physics Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany Cluster of Excellence "Hearing4all," Oldenburg, Germany
| | - Volker Hohmann
- Medical Physics Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany Cluster of Excellence "Hearing4all," Oldenburg, Germany HörTech gGmbH, Oldenburg, Germany
| | - Mathias Dietz
- Medical Physics Group, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany Cluster of Excellence "Hearing4all," Oldenburg, Germany
| |
Collapse
|
38
|
Lauwereins S, Badami K, Meert W, Verhelst M. Optimal resource usage in ultra-low-power sensor interfaces through context- and resource-cost-aware machine learning. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2014.11.077] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
39
|
Marchegiani L, Fafoutis X. On cross-language consonant identification in second language noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 138:2206-2209. [PMID: 26520302 DOI: 10.1121/1.4930955] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Speech perception in everyday conditions is highly affected by the presence of noise of a different nature. The presence of overlapping speakers is considered an especially challenging scenario, as it introduces both energetic and informational masking. The efficacy of the masking also depends on the familiarity with the language of both the target and masking stimuli. This work analyses consonant identification by non-native English speakers in N-talker natural babble noise and babble-modulated noise, by varying the number of talkers in the babble. In particular, only English consonants that are also present in all the native languages of the subjects are used. As the subjects are familiar with the consonants used, this study can be considered a step towards a deeper analysis on perception of first language speech in the presence of second language maskers.
Collapse
Affiliation(s)
- Letizia Marchegiani
- Language and Speech Laboratory, Faculty of Art, University of the Basque Country, 01006 Vitoria-Gasteiz, Spain
| | - Xenofon Fafoutis
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| |
Collapse
|
40
|
Sang J, Hu H, Zheng C, Li G, Lutman ME, Bleeck S. Speech quality evaluation of a sparse coding shrinkage noise reduction algorithm with normal hearing and hearing impaired listeners. Hear Res 2015; 327:175-85. [DOI: 10.1016/j.heares.2015.07.019] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/24/2014] [Revised: 07/20/2015] [Accepted: 07/23/2015] [Indexed: 12/01/2022]
|
41
|
Garg A, Sahu O. Cuckoo search based optimal mask generation for noise suppression and enhancement of speech signal. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2015. [DOI: 10.1016/j.jksuci.2014.04.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
42
|
Leitner C, Pernkopf F. On pre-image iterations for speech enhancement. SPRINGERPLUS 2015; 4:243. [PMID: 26085973 PMCID: PMC4464577 DOI: 10.1186/s40064-015-0983-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Accepted: 04/17/2015] [Indexed: 11/10/2022]
Abstract
In this paper, we apply kernel PCA for speech enhancement and derive pre-image iterations for speech enhancement. Both methods make use of a Gaussian kernel. The kernel variance serves as tuning parameter that has to be adapted according to the SNR and the desired degree of de-noising. We develop a method to derive a suitable value for the kernel variance from a noise estimate to adapt pre-image iterations to arbitrary SNRs. In experiments, we compare the performance of kernel PCA and pre-image iterations in terms of objective speech quality measures and automatic speech recognition. The speech data is corrupted by white and colored noise at 0, 5, 10, and 15 dB SNR. As a benchmark, we provide results of the generalized subspace method, of spectral subtraction, and of the minimum mean-square error log-spectral amplitude estimator. In terms of the scores of the PEASS (Perceptual Evaluation Methods for Audio Source Separation) toolbox, the proposed methods achieve a similar performance as the reference methods. The speech recognition experiments show that the utterances processed by pre-image iterations achieve a consistently better word recognition accuracy than the unprocessed noisy utterances and than the utterances processed by the generalized subspace method.
Collapse
Affiliation(s)
- Christina Leitner
- JOANNEUM RESEARCH Forschungsgesellschaft mbH, DIGITAL - Institute for Information and Communication Technologies, Steyrergasse 17, Graz, 8010 Austria
| | - Franz Pernkopf
- Graz University of Technology, Institute of Signal Processing and Speech Communication, Inffeldgasse 16c, Graz, 8010 Austria
| |
Collapse
|
43
|
Yook S, Nam KW, Kim H, Hong SH, Jang DP, Kim IY. An Environment-Adaptive Management Algorithm for Hearing-Support Devices Incorporating Listening Situation and Noise Type Classifiers. Artif Organs 2014; 39:361-8. [DOI: 10.1111/aor.12391] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Sunhyun Yook
- Department of Biomedical Engineering; Hanyang University; Seoul Korea
| | - Kyoung Won Nam
- Department of Biomedical Engineering; Hanyang University; Seoul Korea
| | - Heepyung Kim
- Department of Biomedical Engineering; Hanyang University; Seoul Korea
| | - Sung Hwa Hong
- Department of Otolaryngology-Head and Neck Surgery; Samsung Medical Center; Seoul Korea
| | - Dong Pyo Jang
- Department of Biomedical Engineering; Hanyang University; Seoul Korea
| | - In Young Kim
- Department of Biomedical Engineering; Hanyang University; Seoul Korea
| |
Collapse
|
44
|
Koning R, Madhu N, Wouters J. Ideal time-frequency masking algorithms lead to different speech intelligibility and quality in normal-hearing and cochlear implant listeners. IEEE Trans Biomed Eng 2014; 62:331-41. [PMID: 25167542 DOI: 10.1109/tbme.2014.2351854] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Hearing impaired listeners using cochlear implants (CIs) suffer from a decrease in speech intelligibility (SI) in adverse listening conditions. Time-frequency masks are often applied to perform noise suppression in an attempt to increase SI. Two important masks are the so-called ideal binary mask (IBM) with its binary weights and the ideal Wiener filter (IWF) with its continuous weights. It is unclear which of the masks has the highest potential for SI and speech quality enhancement in CI users. In this study, both approaches for SI and quality enhancement were compared. The investigations were conducted in normal-hearing (NH) subjects listening to noise vocoder CI simulations and in CI users. The potential for SI improvement was assessed in a sentence recognition task with ideal mask estimates in multitalker babble and with an interfering talker. The robustness of the approaches was evaluated with simulated estimation errors. CI users assessed the speech quality in a preference rating. The IWF outperformed the IBM in NH listeners. In contrast, no significant difference was obtained in CI users. Estimation errors degraded SI in CI users for both approaches. In terms of quality, the IWF outperformed, slightly, the IBM processed signals. The outcomes of this study suggest that the mask pattern is not that crucial for CIs. Results of speech enhancement algorithms obtained with NH subjects listening to vocoded or normally processed stimuli do not translate to CI users. This outcome means that the effect of new strategies has to be quantified with the user group considered.
Collapse
|
45
|
Yu C, Wójcicki KK, Loizou PC, Hansen JHL, Johnson MT. Evaluation of the importance of time-frequency contributions to speech intelligibility in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2014; 135:3007-16. [PMID: 24815280 PMCID: PMC4032418 DOI: 10.1121/1.4869088] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2013] [Revised: 12/27/2013] [Accepted: 03/07/2014] [Indexed: 05/24/2023]
Abstract
Recent studies on binary masking techniques make the assumption that each time-frequency (T-F) unit contributes an equal amount to the overall intelligibility of speech. The present study demonstrated that the importance of each T-F unit to speech intelligibility varies in accordance with speech content. Specifically, T-F units are categorized into two classes, speech-present T-F units and speech-absent T-F units. Results indicate that the importance of each speech-present T-F unit to speech intelligibility is highly related to the loudness of its target component, while the importance of each speech-absent T-F unit varies according to the loudness of its masker component. Two types of mask errors are also considered, which include miss and false alarm errors. Consistent with previous work, false alarm errors are shown to be more harmful to speech intelligibility than miss errors when the mixture signal-to-noise ratio (SNR) is below 0 dB. However, the relative importance between the two types of error is conditioned on the SNR level of the input speech signal. Based on these observations, a mask-based objective measure, the loudness weighted hit-false, is proposed for predicting speech intelligibility. The proposed objective measure shows significantly higher correlation with intelligibility compared to two existing mask-based objective measures.
Collapse
Affiliation(s)
- Chengzhu Yu
- Department of Electrical Engineering, Erik Jonsson School of Enigneering and Computer Science, University of Texas at Dallas, Richardson, Texas 75083
| | - Kamil K Wójcicki
- Department of Electrical Engineering, Erik Jonsson School of Enigneering and Computer Science, University of Texas at Dallas, Richardson, Texas 75083
| | - Philipos C Loizou
- Department of Electrical Engineering, Erik Jonsson School of Enigneering and Computer Science, University of Texas at Dallas, Richardson, Texas 75083
| | - John H L Hansen
- Department of Electrical Engineering, Erik Jonsson School of Enigneering and Computer Science, University of Texas at Dallas, Richardson, Texas 75083
| | - Michael T Johnson
- Speech and Signal Processing Laboratory, Marquette University, 1515 West Wisconsin Avenue, Milwaukee, Wisconsin 53201-1881
| |
Collapse
|
46
|
Liu C, Azimi B, Bhandary M, Hu Y. Contribution of low-frequency harmonics to Mandarin Chinese tone identification in quiet and six-talker babble background. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2014; 135:428-438. [PMID: 24437783 DOI: 10.1121/1.4837255] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The goal of this study was to investigate Mandarin Chinese tone identification in quiet and multi-talker babble conditions for normal-hearing listeners. Tone identification was measured with speech stimuli and stimuli with low and/or high harmonics that were embedded in three Mandarin vowels with two fundamental frequencies. There were six types of stimuli: all harmonics (All), low harmonics (Low), high harmonics (High), and the first (H1), second (H2), and third (H3) harmonic. Results showed that, for quiet conditions, individual harmonics carried frequency contour information well enough for tone identification with high accuracy; however, in noisy conditions, tone identification with individual low harmonics (e.g., H1, H2, and H3) was significantly lower than that with the Low, High, and All harmonics. Moreover, tone identification with individual harmonics in noise was lower for a low F0 than for a high F0, and was also dependent on vowel category. Tone identification with individual low-frequency harmonics was accounted for by local signal-to-noise ratios, indicating that audibility of harmonics in noise may play a primary role in tone identification.
Collapse
Affiliation(s)
- Chang Liu
- Department of Communication Sciences and Disorders, University of Texas at Austin, 1 University Station A1100, Austin, Texas 78712
| | - Behnam Azimi
- Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin 53201
| | - Moulesh Bhandary
- Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin 53201
| | - Yi Hu
- Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin 53201
| |
Collapse
|
47
|
Hu Y, Tahmina Q, Runge C, Friedland DR. The perception of telephone-processed speech by combined electric and acoustic stimulation. Trends Amplif 2013; 17:189-96. [PMID: 24265213 DOI: 10.1177/1084713813512901] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This study assesses the effects of adding low- or high-frequency information to the band-limited telephone-processed speech on bimodal listeners' telephone speech perception in quiet environments. In the proposed experiments, bimodal users were presented under quiet listening conditions with wideband speech (WB), bandpass-filtered telephone speech (300-3,400 Hz, BP), high-pass filtered speech (f > 300 Hz, HP, i.e., distorted frequency components above 3,400 Hz in telephone speech were restored), and low-pass filtered speech (f < 3,400 Hz, LP, i.e., distorted frequency components below 300 Hz in telephone speech were restored). Results indicated that in quiet environments, for all four types of stimuli, listening with both hearing aid (HA) and cochlear implant (CI) was significantly better than listening with CI alone. For both bimodal and CI-alone modes, there were no statistically significant differences between the LP and BP scores and between the WB and HP scores. However, the HP scores were significantly better than the BP scores. In quiet conditions, both CI alone and bimodal listening achieved the largest benefits when telephone speech was augmented with high rather than low-frequency information. These findings provide support for the design of algorithms that would extend higher frequency information, at least in quiet environments.
Collapse
Affiliation(s)
- Yi Hu
- 1Department of Electrical Engineering & Computer Science, University of Wisconsin-Milwaukee, WI, USA
| | | | | | | |
Collapse
|
48
|
Marković I, Jurić-Kavelj S, Petrović I. Partial mutual information based input variable selection for supervised learning approaches to voice activity detection. Appl Soft Comput 2013. [DOI: 10.1016/j.asoc.2013.06.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
49
|
Ye H, Deng G, Mauger SJ, Hersbach AA, Dawson PW, Heasman JM. A wavelet-based noise reduction algorithm and its clinical evaluation in cochlear implants. PLoS One 2013; 8:e75662. [PMID: 24086605 PMCID: PMC3784455 DOI: 10.1371/journal.pone.0075662] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Accepted: 08/16/2013] [Indexed: 11/18/2022] Open
Abstract
Noise reduction is often essential for cochlear implant (CI) recipients to achieve acceptable speech perception in noisy environments. Most noise reduction algorithms applied to audio signals are based on time-frequency representations of the input, such as the Fourier transform. Algorithms based on other representations may also be able to provide comparable or improved speech perception and listening quality improvements. In this paper, a noise reduction algorithm for CI sound processing is proposed based on the wavelet transform. The algorithm uses a dual-tree complex discrete wavelet transform followed by shrinkage of the wavelet coefficients based on a statistical estimation of the variance of the noise. The proposed noise reduction algorithm was evaluated by comparing its performance to those of many existing wavelet-based algorithms. The speech transmission index (STI) of the proposed algorithm is significantly better than other tested algorithms for the speech-weighted noise of different levels of signal to noise ratio. The effectiveness of the proposed system was clinically evaluated with CI recipients. A significant improvement in speech perception of 1.9 dB was found on average in speech weighted noise.
Collapse
Affiliation(s)
- Hua Ye
- Department of Electronic Engineering, La Trobe University, Melbourne, Victoria, Australia
| | - Guang Deng
- Department of Electronic Engineering, La Trobe University, Melbourne, Victoria, Australia
| | | | | | - Pam W. Dawson
- Cochlear Limited, Melbourne, Victoria, Australia
- HEARing CRC, Melbourne, Victoria, Australia
| | | |
Collapse
|
50
|
Mirzahasanloo TS, Kehtarnavaz N, Gopalakrishna V, Loizou PC. Environment-adaptive speech enhancement for bilateral cochlear implants using a single processor. SPEECH COMMUNICATION 2013; 55:523-534. [PMID: 24610967 PMCID: PMC3945750 DOI: 10.1016/j.specom.2012.10.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
A computationally efficient speech enhancement pipeline in noisy environments based on a single-processor implementation is developed for utilization in bilateral cochlear implant systems. A two-channel joint objective function is defined and a closed form solution is obtained based on the weighted-Euclidean distortion measure. The computational efficiency and no need for synchronization aspects of this pipeline make it a suitable solution for real-time deployment. A speech quality measure is used to show its effectiveness in six different noisy environments as compared to a similar one-channel enhancement pipeline when using two separate processors or when using independent sequential processing.
Collapse
Affiliation(s)
| | - Nasser Kehtarnavaz
- Corresponding author. Tel.: +1 972 883 6838; fax: +1 972 883 2710. (T.S. Mirzahasanloo), (N. Kehtarnavaz)
| | | | | |
Collapse
|