1
|
Luo X, Ke Y, Li X, Zheng C. On phase recovery and preserving early reflections for deep-learning speech dereverberation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:436-451. [PMID: 38240664 DOI: 10.1121/10.0024348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Accepted: 12/21/2023] [Indexed: 01/23/2024]
Abstract
In indoor environments, reverberation often distorts clean speech. Although deep learning-based speech dereverberation approaches have shown much better performance than traditional ones, the inferior speech quality of the dereverberated speech caused by magnitude distortion and limited phase recovery is still a serious problem for practical applications. This paper improves the performance of deep learning-based speech dereverberation from the perspectives of both network design and mapping target optimization. Specifically, on the one hand, a bifurcated-and-fusion network and its guidance loss functions were designed to help reduce the magnitude distortion while enhancing the phase recovery. On the other hand, the time boundary between the early and late reflections in the mapped speech was investigated, so as to make a balance between the reverberation tailing effect and the difficulty of magnitude/phase recovery. Mathematical derivations were provided to show the rationality of the specially designed loss functions. Geometric illustrations were given to explain the importance of preserving early reflections in reducing the difficulty of phase recovery. Ablation study results confirmed the validity of the proposed network topology and the importance of preserving 20 ms early reflections in the mapped speech. Objective and subjective test results showed that the proposed system outperformed other baselines in the speech dereverberation task.
Collapse
Affiliation(s)
- Xiaoxue Luo
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| | - Yuxuan Ke
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| | - Xiaodong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| | - Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| |
Collapse
|
2
|
Zheng C, Zhang H, Liu W, Luo X, Li A, Li X, Moore BCJ. Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods. Trends Hear 2023; 27:23312165231209913. [PMID: 37956661 PMCID: PMC10658184 DOI: 10.1177/23312165231209913] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 10/09/2023] [Indexed: 11/15/2023] Open
Abstract
Frequency-domain monaural speech enhancement has been extensively studied for over 60 years, and a great number of methods have been proposed and applied to many devices. In the last decade, monaural speech enhancement has made tremendous progress with the advent and development of deep learning, and performance using such methods has been greatly improved relative to traditional methods. This survey paper first provides a comprehensive overview of traditional and deep-learning methods for monaural speech enhancement in the frequency domain. The fundamental assumptions of each approach are then summarized and analyzed to clarify their limitations and advantages. A comprehensive evaluation of some typical methods was conducted using the WSJ + Deep Noise Suppression (DNS) challenge and Voice Bank + DEMAND datasets to give an intuitive and unified comparison. The benefits of monaural speech enhancement methods using objective metrics relevant for normal-hearing and hearing-impaired listeners were evaluated. The objective test results showed that compression of the input features was important for simulated normal-hearing listeners but not for simulated hearing-impaired listeners. Potential future research and development topics in monaural speech enhancement are suggested.
Collapse
Affiliation(s)
- Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Huiyong Zhang
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wenzhe Liu
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoxue Luo
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Andong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiaodong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Brian C. J. Moore
- Cambridge Hearing Group, Department of Psychology, University of Cambridge, Cambridge, UK
| |
Collapse
|
3
|
Pandey A, Wang D. Self-attending RNN for Speech Enhancement to Improve Cross-corpus Generalization. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2022; 30:1374-1385. [PMID: 36245814 PMCID: PMC9560045 DOI: 10.1109/taslp.2022.3161143] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Deep neural networks (DNNs) represent the mainstream methodology for supervised speech enhancement, primarily due to their capability to model complex functions using hierarchical representations. However, a recent study revealed that DNNs trained on a single corpus fail to generalize to untrained corpora, especially in low signal-to-noise ratio (SNR) conditions. Developing a noise, speaker, and corpus independent speech enhancement algorithm is essential for real-world applications. In this study, we propose a self-attending recurrent neural network (SARNN) for time-domain speech enhancement to improve cross-corpus generalization. SARNN comprises of recurrent neural networks (RNNs) augmented with self-attention blocks and feedforward blocks. We evaluate SARNN on different corpora with nonstationary noises in low SNR conditions. Experimental results demonstrate that SARNN substantially outperforms competitive approaches to time-domain speech enhancement, such as RNNs and dual-path SARNNs. Additionally, we report an important finding that the two popular approaches to speech enhancement: complex spectral mapping and time-domain enhancement, obtain similar results for RNN and SARNN with large-scale training. We also provide a challenging subset of the test set used in this study for evaluating future algorithms and facilitating direct comparisons.
Collapse
Affiliation(s)
- Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
4
|
A context aware-based deep neural network approach for simultaneous speech denoising and dereverberation. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-06968-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
5
|
Healy EW, Taherian H, Johnson EM, Wang D. A causal and talker-independent speaker separation/dereverberation deep learning algorithm: Cost associated with conversion to real-time capable operation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:3976. [PMID: 34852625 PMCID: PMC8612765 DOI: 10.1121/10.0007134] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Revised: 10/19/2021] [Accepted: 10/22/2021] [Indexed: 05/20/2023]
Abstract
The fundamental requirement for real-time operation of a speech-processing algorithm is causality-that it operate without utilizing future time frames. In the present study, the performance of a fully causal deep computational auditory scene analysis algorithm was assessed. Target sentences were isolated from complex interference consisting of an interfering talker and concurrent room reverberation. The talker- and corpus/channel-independent model used Dense-UNet and temporal convolutional networks and estimated both magnitude and phase of the target speech. It was found that mean algorithm benefit was significant in every condition. Mean benefit for hearing-impaired (HI) listeners across all conditions was 46.4 percentage points. The cost of converting the algorithm to causal processing was also assessed by comparing to a prior non-causal version. Intelligibility decrements for HI and normal-hearing listeners from non-causal to causal processing were present in most but not all conditions, and these decrements were statistically significant in half of the conditions tested-those representing the greater levels of complex interference. Although a cost associated with causal processing was present in most conditions, it may be considered modest relative to the overall level of benefit.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - Hassan Taherian
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric M Johnson
- Department of Speech and Hearing Science, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
6
|
Pandey A, Wang D. Dense CNN with Self-Attention for Time-Domain Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2021; 29:1270-1279. [PMID: 33997107 PMCID: PMC8118093 DOI: 10.1109/taslp.2021.3064421] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the decoder comprises a dense block and an attention module. Dense blocks and attention modules help in feature extraction using a combination of feature reuse, increased network depth, and maximum context aggregation. Furthermore, we reveal previously unknown problems with a loss based on the spectral magnitude of enhanced speech. To alleviate these problems, we propose a novel loss based on magnitudes of enhanced speech and a predicted noise. Even though the proposed loss is based on magnitudes only, a constraint imposed by noise prediction ensures that the loss enhances both magnitude and phase. Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.
Collapse
Affiliation(s)
- Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
7
|
Li A, Zheng C, Peng R, Li X. On the importance of power compression and phase estimation in monaural speech dereverberation. JASA EXPRESS LETTERS 2021; 1:014802. [PMID: 36154095 DOI: 10.1121/10.0003321] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Previous studies have shown the importance of introducing power compression on both feature and target when only the magnitude is considered in the dereverberation task. When both real and imaginary components are estimated without power compression, it has been shown that it is important to take magnitude constraint into account. In this paper, both power compression and phase estimation are considered to show their equal importance in the dereverberation task, where we propose to reconstruct the compressed real and imaginary components (cRI) for training. Both objective and subjective results reveal that better dereverberation can be achieved when using cRI.
Collapse
Affiliation(s)
- Andong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China , , ,
| | - Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China , , ,
| | - Renhua Peng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China , , ,
| | - Xiaodong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China , , ,
| |
Collapse
|