1
|
Ullah R, Zhang S, Asif M, Wahab F. Multimodal learning-based speech enhancement and separation, recent innovations, new horizons, challenges and real-world applications. Comput Biol Med 2025; 190:110082. [PMID: 40174498 DOI: 10.1016/j.compbiomed.2025.110082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 01/18/2025] [Accepted: 03/24/2025] [Indexed: 04/04/2025]
Abstract
With the increasing global prevalence of disabling hearing loss, speech enhancement technologies have become crucial for overcoming communication barriers and improving the quality of life for those affected. Multimodal learning has emerged as a powerful approach for speech enhancement and separation, integrating information from various sensory modalities such as audio signals, visual cues, and textual data. Despite substantial progress, challenges remain in synchronizing modalities, ensuring model robustness, and achieving scalability for real-time applications. This paper provides a comprehensive review of the latest advances in the most promising strategy, multimodal learning for speech enhancement and separation. We underscore the limitations of various methods in noisy and dynamic real-world environments and demonstrate how multimodal systems leverage complementary information from lip movements, text transcripts, and even brain signals to enhance performance. Critical deep learning architectures are covered, such as Transformers, Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), and generative models like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models. Various fusion strategies, including early and late fusion and attention mechanisms, are explored to address challenges in aligning and integrating multimodal inputs effectively. Furthermore, the paper explores important real-world applications in areas like automatic driver monitoring in autonomous vehicles, emotion recognition for mental health monitoring, augmented reality in interactive retail, smart surveillance for public safety, remote healthcare and telemedicine, and hearing assistive devices. Additionally, critical advanced procedures, comparisons, future challenges, and prospects are discussed to guide future research in multimodal learning for speech enhancement and separation, offering a roadmap for new horizons in this transformative field.
Collapse
Affiliation(s)
- Rizwan Ullah
- School of Mechanical Engineering, Dongguan University of Technology, Dongguan, 523808, PR China; School of Mechanical and Automotive Engineering, South China University of Technology, Guangzhou, 510640, PR China
| | - Shaohui Zhang
- School of Mechanical Engineering, Dongguan University of Technology, Dongguan, 523808, PR China.
| | - Muhammad Asif
- Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu, 28100, Pakistan
| | - Fazale Wahab
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui, PR China
| |
Collapse
|
2
|
Wang ZQ. SuperM2M: Supervised and mixture-to-mixture co-learning for speech enhancement and noise-robust ASR. Neural Netw 2025; 188:107408. [PMID: 40157231 DOI: 10.1016/j.neunet.2025.107408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 01/31/2025] [Accepted: 03/13/2025] [Indexed: 04/01/2025]
Abstract
The current dominant approach for neural speech enhancement is based on supervised learning by using simulated training data. The trained models, however, often exhibit limited generalizability to real-recorded data. To address this, this paper investigates training enhancement models directly on real target-domain data. We propose to adapt mixture-to-mixture (M2M) training, originally designed for speaker separation, for speech enhancement, by modeling multi-source noise signals as a single, combined source. In addition, we propose a co-learning algorithm that improves M2M with the help of supervised algorithms. When paired close-talk and far-field mixtures are available for training, M2M realizes speech enhancement by training a deep neural network (DNN) to produce speech and noise estimates in a way such that they can be linearly filtered to reconstruct the close-talk and far-field mixtures. This way, the DNN can be trained directly on real mixtures, and can leverage close-talk and far-field mixtures as a weak supervision to enhance far-field mixtures. To improve M2M, we combine it with supervised approaches to co-train the DNN, where mini-batches of real close-talk and far-field mixture pairs and mini-batches of simulated mixture and clean speech pairs are alternately fed to the DNN, and the loss functions are respectively (a) the mixture reconstruction loss on the real close-talk and far-field mixtures and (b) the regular enhancement loss on the simulated clean speech and noise. We find that, this way, the DNN can learn from real and simulated data to achieve better generalization to real data. We name this algorithm SuperM2M (supervised and mixture-to-mixture co-learning). Evaluation results on the CHiME-4 dataset show its effectiveness and potential.
Collapse
Affiliation(s)
- Zhong-Qiu Wang
- Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, Guangdong, PR China.
| |
Collapse
|
3
|
Huang K, Liu M, Ma S. Nearly Optimal Learning Using Sparse Deep ReLU Networks in Regularized Empirical Risk Minimization With Lipschitz Loss. Neural Comput 2025; 37:815-870. [PMID: 40030138 DOI: 10.1162/neco_a_01742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Accepted: 11/27/2024] [Indexed: 03/19/2025]
Abstract
We propose a sparse deep ReLU network (SDRN) estimator of the regression function obtained from regularized empirical risk minimization with a Lipschitz loss function. Our framework can be applied to a variety of regression and classification problems. We establish novel nonasymptotic excess risk bounds for our SDRN estimator when the regression function belongs to a Sobolev space with mixed derivatives. We obtain a new, nearly optimal, risk rate in the sense that the SDRN estimator can achieve nearly the same optimal minimax convergence rate as one-dimensional nonparametric regression with the dimension involved in a logarithm term only when the feature dimension is fixed. The estimator has a slightly slower rate when the dimension grows with the sample size. We show that the depth of the SDRN estimator grows with the sample size in logarithmic order, and the total number of nodes and weights grows in polynomial order of the sample size to have the nearly optimal risk rate. The proposed SDRN can go deeper with fewer parameters to well estimate the regression and overcome the overfitting problem encountered by conventional feedforward neural networks.
Collapse
Affiliation(s)
- Ke Huang
- Department of Statistics, University of California, Riverside, Riverside 92521, CA, U.S.A.
| | - Mingming Liu
- Department of Statistics, University of California, Riverside, Riverside 92521, CA, U.S.A.
| | - Shujie Ma
- Department of Statistics, University of California, Riverside, Riverside 92521, CA, U.S.A.
| |
Collapse
|
4
|
Im J, Pak S, Woo SY, Shin W, Lee ST. Flash Memory for Synaptic Plasticity in Neuromorphic Computing: A Review. Biomimetics (Basel) 2025; 10:121. [PMID: 39997144 PMCID: PMC11852767 DOI: 10.3390/biomimetics10020121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2024] [Revised: 01/17/2025] [Accepted: 01/21/2025] [Indexed: 02/26/2025] Open
Abstract
The rapid expansion of data has made global access easier, but it also demands increasing amounts of energy for data storage and processing. In response, neuromorphic electronics, inspired by the functionality of biological neurons and synapses, have emerged as a growing area of research. These devices enable in-memory computing, helping to overcome the "von Neumann bottleneck", a limitation caused by the separation of memory and processing units in traditional von Neumann architecture. By leveraging multi-bit non-volatility, biologically inspired features, and Ohm's law, synaptic devices show great potential for reducing energy consumption in multiplication and accumulation operations. Within the various non-volatile memory technologies available, flash memory stands out as a highly competitive option for storing large volumes of data. This review highlights recent advancements in neuromorphic computing that utilize NOR, AND, and NAND flash memory. This review also delves into the array architecture, operational methods, and electrical properties of NOR, AND, and NAND flash memory, emphasizing its application in different neural network designs. By providing a detailed overview of flash memory-based neuromorphic computing, this review offers valuable insights into optimizing its use across diverse applications.
Collapse
Affiliation(s)
- Jisung Im
- School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea;
| | - Sangyeon Pak
- School of Electronic and Electrical Engineering, Hongik University, Seoul 04066, Republic of Korea;
| | - Sung-Yun Woo
- School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea;
| | - Wonjun Shin
- Department of Semiconductor Convergence Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - Sung-Tae Lee
- School of Electronic and Electrical Engineering, Hongik University, Seoul 04066, Republic of Korea;
| |
Collapse
|
5
|
Vinotha R, Hepsiba D, Vijay Anand LD, Andrew J, Jennifer Eunice R. Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning. Sci Rep 2024; 14:29455. [PMID: 39604526 PMCID: PMC11603152 DOI: 10.1038/s41598-024-80764-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 11/21/2024] [Indexed: 11/29/2024] Open
Abstract
Dysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems. This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR). A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems. The S-SEGAN integrates SEGAN's adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance. Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR. These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented). Evaluations demonstrate significant DSR accuracy improvements in DSE integration. The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.60% and 69.87%, respectively. The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.07% and C-HAN reaching 73%. The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.40%, while that of the Conformer model reaches 74.33%. Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.73% and 76.87%, respectively. Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.47% and 79.20%, respectively. Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.13% and 84.07%, respectively, with C-HAN displaying the highest performance among all proposed models.
Collapse
Affiliation(s)
- R Vinotha
- Division of Robotics Engineering, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu, India
| | - D Hepsiba
- Division of Biomedical Engineering, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu, India
| | - L D Vijay Anand
- Division of Robotics Engineering, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu, India
| | - J Andrew
- Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India.
| | - R Jennifer Eunice
- Department of Mechatronics Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India.
| |
Collapse
|
6
|
Henry F, Glavin M, Jones E, Parsi A. Impact of Mask Type as Training Target for Speech Intelligibility and Quality in Cochlear-Implant Noise Reduction. SENSORS (BASEL, SWITZERLAND) 2024; 24:6614. [PMID: 39460094 PMCID: PMC11511210 DOI: 10.3390/s24206614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Revised: 10/06/2024] [Accepted: 10/10/2024] [Indexed: 10/28/2024]
Abstract
The selection of a target when training deep neural networks for speech enhancement is an important consideration. Different masks have been shown to exhibit different performance characteristics depending on the application and the conditions. This paper presents a comprehensive comparison of several different masks for noise reduction in cochlear implants. The study incorporated three well-known masks, namely the Ideal Binary Mask (IBM), Ideal Ratio Mask (IRM) and the Fast Fourier Transform Mask (FFTM), as well as two newly proposed masks, based on existing masks, called the Quantized Mask (QM) and the Phase-Sensitive plus Ideal Ratio Mask (PSM+). These five masks are used to train networks to estimate masks for the purpose of separating speech from noisy mixtures. A vocoder was used to simulate the behavior of a cochlear implant. Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) scores indicate that the two new masks proposed in this study (QM and PSM+) perform best for normal speech intelligibility and quality in the presence of stationary and non-stationary noise over a range of signal-to-noise ratios (SNRs). The Normalized Covariance Measure (NCM) and similarity scores indicate that they also perform best for speech intelligibility/gauging the similarity of vocoded speech. The Quantized Mask performs better than the Ideal Binary Mask due to its better resolution as it approximates the Wiener Gain Function. The PSM+ performs better than the three existing benchmark masks (IBM, IRM, and FFTM) as it incorporates both magnitude and phase information.
Collapse
Affiliation(s)
- Fergal Henry
- Department of Computing and Electronic Engineering, Atlantic Technological University, Ash Lane, F91YW50 Sligo, Ireland
| | - Martin Glavin
- Electrical and Electronic Engineering, University of Galway, University Road, H91TK33 Galway, Ireland; (M.G.); (E.J.); (A.P.)
| | - Edward Jones
- Electrical and Electronic Engineering, University of Galway, University Road, H91TK33 Galway, Ireland; (M.G.); (E.J.); (A.P.)
| | - Ashkan Parsi
- Electrical and Electronic Engineering, University of Galway, University Road, H91TK33 Galway, Ireland; (M.G.); (E.J.); (A.P.)
| |
Collapse
|
7
|
Han JY, Li JH, Yang CS, Chen F, Liao WH, Liao YF, Lai YH. Leveraging Deep Learning to Enhance Optical Microphone System Performance with Unknown Speakers for Cochlear Implants. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-6. [PMID: 40039183 DOI: 10.1109/embc53108.2024.10782084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Cochlear implants (CI) play a crucial role in restoring hearing for individuals with profound-to-severe hearing loss. However, challenges persist, particularly in low signal-to-noise ratios and distant talk scenarios. This study introduces an innovative solution by integrating a Laser Doppler vibrometer (LDV) with deep learning to reconstruct clean speech from unknown speakers in noisy conditions. Objective evaluations, including short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ), demonstrate the superior performance of the proposed-LDV system over traditional microphones and a baseline LDV system under the same recording conditions. STOI scores for Mic-Noisy, Mic-log Minimum Mean Square Error (logMMSE), baseline-LDV, and proposed-LDV were 0.44, 0.35, 0.48, and 0.73, respectively, whereas PESQ scores were 1.51, 1.76, 1.4, 0.73, and 1.96, respectively. Furthermore, the vocoder simulation listening testing results showed the proposed system achieving a higher word accuracy score than baselines systems. These findings highlight the potential of the proposed system as a robust speech capture method for CI users, addressing challenges related to noise and distance.
Collapse
|
8
|
Lee GW, Kim HK. Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition. SENSORS (BASEL, SWITZERLAND) 2024; 24:2573. [PMID: 38676191 PMCID: PMC11054889 DOI: 10.3390/s24082573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 04/08/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024]
Abstract
This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores.
Collapse
Affiliation(s)
- Geon Woo Lee
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea;
| | - Hong Kook Kim
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea;
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea
- AunionAI Co., Ltd., Gwangju 61005, Republic of Korea
| |
Collapse
|
9
|
Fan J, Williamson DS. From the perspective of perceptual speech quality: The robustness of frequency bands to noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:1916-1927. [PMID: 38456734 DOI: 10.1121/10.0025272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 02/22/2024] [Indexed: 03/09/2024]
Abstract
Speech quality is one of the main foci of speech-related research, where it is frequently studied with speech intelligibility, another essential measurement. Band-level perceptual speech intelligibility, however, has been studied frequently, whereas speech quality has not been thoroughly analyzed. In this paper, a Multiple Stimuli With Hidden Reference and Anchor (MUSHRA) inspired approach was proposed to study the individual robustness of frequency bands to noise with perceptual speech quality as the measure. Speech signals were filtered into thirty-two frequency bands with compromising real-world noise employed at different signal-to-noise ratios. Robustness to noise indices of individual frequency bands was calculated based on the human-rated perceptual quality scores assigned to the reconstructed noisy speech signals. Trends in the results suggest the mid-frequency region appeared less robust to noise in terms of perceptual speech quality. These findings suggest future research aiming at improving speech quality should pay more attention to the mid-frequency region of the speech signals accordingly.
Collapse
Affiliation(s)
- Junyi Fan
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Donald S Williamson
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
10
|
Cherukuru P, Mustafa MB. CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing. PeerJ Comput Sci 2024; 10:e1901. [PMID: 38435554 PMCID: PMC10909157 DOI: 10.7717/peerj-cs.1901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 01/31/2024] [Indexed: 03/05/2024]
Abstract
Speech enhancement algorithms are applied in multiple levels of enhancement to improve the quality of speech signals under noisy environments known as multi-channel speech enhancement (MCSE) systems. Numerous existing algorithms are used to filter noise in speech enhancement systems, which are typically employed as a pre-processor to reduce noise and improve speech quality. They may, however, be limited in performing well under low signal-to-noise ratio (SNR) situations. The speech devices are exposed to all kinds of environmental noises which may go up to a high-level frequency of noises. The objective of this research is to conduct a noise reduction experiment for a multi-channel speech enhancement (MCSE) system in stationary and non-stationary environmental noisy situations with varying speech signal SNR levels. The experiments examined the performance of the existing and the proposed MCSE systems for environmental noises in filtering low to high SNRs environmental noises (-10 dB to 20 dB). The experiments were conducted using the AURORA and LibriSpeech datasets, which consist of different types of environmental noises. The existing MCSE (BAV-MCSE) makes use of beamforming, adaptive noise reduction and voice activity detection algorithms (BAV) to filter the noises from speech signals. The proposed MCSE (DWT-CNN-MCSE) system was developed based on discrete wavelet transform (DWT) preprocessing and convolution neural network (CNN) for denoising the input noisy speech signals to improve the performance accuracy. The performance of the existing BAV-MCSE and the proposed DWT-CNN-MCSE were measured using spectrogram analysis and word recognition rate (WRR). It was identified that the existing BAV-MCSE reported the highest WRR at 93.77% for a high SNR (at 20 dB) and 5.64% on average for a low SNR (at -10 dB) for different noises. The proposed DWT-CNN-MCSE system has proven to perform well at a low SNR with WRR of 70.55% and the highest improvement (64.91% WRR) at -10 dB SNR.
Collapse
Affiliation(s)
- Pavani Cherukuru
- Department of Software Engineering, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Malaysia
- Department of Information Science, Dayananda Sagar Academy of Technology and Management, Bangalore, Karnataka, India
| | - Mumtaz Begum Mustafa
- Department of Software Engineering, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Malaysia
| |
Collapse
|
11
|
Kim S, Athi M, Shi G, Kim M, Kristjansson T. Zero-shot test time adaptation via knowledge distillation for personalized speech denoising and dereverberation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:1353-1367. [PMID: 38364043 DOI: 10.1121/10.0024621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 01/11/2024] [Indexed: 02/18/2024]
Abstract
A personalization framework to adapt compact models to test time environments and improve their speech enhancement (SE) performance in noisy and reverberant conditions is proposed. The use-cases are when the end-user device encounters only one or a few speakers and noise types that tend to reoccur in the specific acoustic environment. Hence, a small personalized model that is sufficient to handle this focused subset of the original universal SE problem is postulated. The study addresses a major data shortage issue: although the goal is to learn from a specific user's speech signals and the test time environment, the target clean speech is unavailable for model training due to privacy-related concerns and technical difficulty of recording noise and reverberation-free voice signals. The proposed zero-shot personalization method uses no clean speech target. Instead, it employs the knowledge distillation framework, where the more advanced denoising results from an overly large teacher work as pseudo targets to train a small student model. Evaluation on various test time conditions suggests that the proposed personalization approach can significantly enhance the compact student model's test time performance. Personalized models outperform larger non-personalized baseline models, demonstrating that personalization achieves model compression with no loss in dereverberation and denoising performance.
Collapse
Affiliation(s)
- Sunwoo Kim
- Amazon Lab126, Sunnyvale, California 94089, USA
| | | | - Guangji Shi
- Amazon Lab126, Sunnyvale, California 94089, USA
| | - Minje Kim
- Amazon Lab126, Sunnyvale, California 94089, USA
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | | |
Collapse
|
12
|
Wang D, Wang J, Sun M. 3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion. PLoS One 2024; 19:e0289453. [PMID: 38285654 PMCID: PMC10824424 DOI: 10.1371/journal.pone.0289453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 01/04/2024] [Indexed: 01/31/2024] Open
Abstract
Singing voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3D Inception-ResUNet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrogram. Multiobjectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1K dataset with NAO robots and synthesized the 10-channel dataset for training the model. The experimental results show that the proposed model trained by multiple objectives reaches an average NSDR of 11.55 dB on the test dataset, which outperforms the comparison model.
Collapse
Affiliation(s)
- DaDong Wang
- School of Mathematics and Computer Science, Jilin Normal University, Siping, Jilin, China
| | - Jie Wang
- School of Mathematics and Computer Science, Jilin Normal University, Siping, Jilin, China
| | - MingChen Sun
- School of Computer Science and Technology, Jilin University, Changchun, Jilin, China
| |
Collapse
|
13
|
Schuller BW, Akman A, Chang Y, Coppock H, Gebhard A, Kathan A, Rituerto-González E, Triantafyllopoulos A, Pokorny FB. Ecology & computer audition: Applications of audio technology to monitor organisms and environment. Heliyon 2024; 10:e23142. [PMID: 38163154 PMCID: PMC10755287 DOI: 10.1016/j.heliyon.2023.e23142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 11/08/2023] [Accepted: 11/27/2023] [Indexed: 01/03/2024] Open
Abstract
Among the 17 Sustainable Development Goals (SDGs) proposed within the 2030 Agenda and adopted by all the United Nations member states, the 13th SDG is a call for action to combat climate change. Moreover, SDGs 14 and 15 claim the protection and conservation of life below water and life on land, respectively. In this work, we provide a literature-founded overview of application areas, in which computer audition - a powerful but in this context so far hardly considered technology, combining audio signal processing and machine intelligence - is employed to monitor our ecosystem with the potential to identify ecologically critical processes or states. We distinguish between applications related to organisms, such as species richness analysis and plant health monitoring, and applications related to the environment, such as melting ice monitoring or wildfire detection. This work positions computer audition in relation to alternative approaches by discussing methodological strengths and limitations, as well as ethical aspects. We conclude with an urgent call to action to the research community for a greater involvement of audio intelligence methodology in future ecosystem monitoring approaches.
Collapse
Affiliation(s)
- Björn W. Schuller
- GLAM – Group on Language, Audio, & Music, Imperial College London, UK
- EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
- audEERING GmbH, Gilching, Germany
| | - Alican Akman
- GLAM – Group on Language, Audio, & Music, Imperial College London, UK
| | - Yi Chang
- GLAM – Group on Language, Audio, & Music, Imperial College London, UK
| | - Harry Coppock
- GLAM – Group on Language, Audio, & Music, Imperial College London, UK
| | - Alexander Gebhard
- EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
| | - Alexander Kathan
- EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
| | - Esther Rituerto-González
- EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
- GPM – Group of Multimedia Processing, University Carlos III of Madrid, Spain
| | | | - Florian B. Pokorny
- EIHW – Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
- Division of Phoniatrics, Medical University of Graz, Austria
| |
Collapse
|
14
|
Luo X, Ke Y, Li X, Zheng C. On phase recovery and preserving early reflections for deep-learning speech dereverberation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:436-451. [PMID: 38240664 DOI: 10.1121/10.0024348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Accepted: 12/21/2023] [Indexed: 01/23/2024]
Abstract
In indoor environments, reverberation often distorts clean speech. Although deep learning-based speech dereverberation approaches have shown much better performance than traditional ones, the inferior speech quality of the dereverberated speech caused by magnitude distortion and limited phase recovery is still a serious problem for practical applications. This paper improves the performance of deep learning-based speech dereverberation from the perspectives of both network design and mapping target optimization. Specifically, on the one hand, a bifurcated-and-fusion network and its guidance loss functions were designed to help reduce the magnitude distortion while enhancing the phase recovery. On the other hand, the time boundary between the early and late reflections in the mapped speech was investigated, so as to make a balance between the reverberation tailing effect and the difficulty of magnitude/phase recovery. Mathematical derivations were provided to show the rationality of the specially designed loss functions. Geometric illustrations were given to explain the importance of preserving early reflections in reducing the difficulty of phase recovery. Ablation study results confirmed the validity of the proposed network topology and the importance of preserving 20 ms early reflections in the mapped speech. Objective and subjective test results showed that the proposed system outperformed other baselines in the speech dereverberation task.
Collapse
Affiliation(s)
- Xiaoxue Luo
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| | - Yuxuan Ke
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| | - Xiaodong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| | - Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| |
Collapse
|
15
|
Ha J, Baek SC, Lim Y, Chung JH. Validation of cost-efficient EEG experimental setup for neural tracking in an auditory attention task. Sci Rep 2023; 13:22682. [PMID: 38114579 PMCID: PMC10730561 DOI: 10.1038/s41598-023-49990-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 12/14/2023] [Indexed: 12/21/2023] Open
Abstract
When individuals listen to speech, their neural activity phase-locks to the slow temporal rhythm, which is commonly referred to as "neural tracking". The neural tracking mechanism allows for the detection of an attended sound source in a multi-talker situation by decoding neural signals obtained by electroencephalography (EEG), known as auditory attention decoding (AAD). Neural tracking with AAD can be utilized as an objective measurement tool for diverse clinical contexts, and it has potential to be applied to neuro-steered hearing devices. To effectively utilize this technology, it is essential to enhance the accessibility of EEG experimental setup and analysis. The aim of the study was to develop a cost-efficient neural tracking system and validate the feasibility of neural tracking measurement by conducting an AAD task using an offline and real-time decoder model outside the soundproof environment. We devised a neural tracking system capable of conducting AAD experiments using an OpenBCI and Arduino board. Nine participants were recruited to assess the performance of the AAD using the developed system, which involved presenting competing speech signals in an experiment setting without soundproofing. As a result, the offline decoder model demonstrated an average performance of 90%, and real-time decoder model exhibited a performance of 78%. The present study demonstrates the feasibility of implementing neural tracking and AAD using cost-effective devices in a practical environment.
Collapse
Affiliation(s)
- Jiyeon Ha
- Department of HY-KIST Bio-Convergence, Hanyang University, Seoul, 04763, Korea
- Center for Intelligent & Interactive Robotics, Artificial Intelligence and Robot Institute, Korea Institute of Science and Technology, Seoul, 02792, Korea
| | - Seung-Cheol Baek
- Center for Intelligent & Interactive Robotics, Artificial Intelligence and Robot Institute, Korea Institute of Science and Technology, Seoul, 02792, Korea
- Research Group Neurocognition of Music and Language, Max Planck Institute for Empirical Aesthetics, 60322, Frankfurt\ Main, Germany
| | - Yoonseob Lim
- Department of HY-KIST Bio-Convergence, Hanyang University, Seoul, 04763, Korea.
- Center for Intelligent & Interactive Robotics, Artificial Intelligence and Robot Institute, Korea Institute of Science and Technology, Seoul, 02792, Korea.
| | - Jae Ho Chung
- Department of HY-KIST Bio-Convergence, Hanyang University, Seoul, 04763, Korea.
- Center for Intelligent & Interactive Robotics, Artificial Intelligence and Robot Institute, Korea Institute of Science and Technology, Seoul, 02792, Korea.
- Department of Otolaryngology-Head and Neck Surgery, College of Medicine, Hanyang University, Seoul, 04763, Korea.
- Department of Otolaryngology-Head and Neck Surgery, School of Medicine, Hanyang University, 222-Wangshimni-ro, Seongdong-gu, Seoul, 133-792, Korea.
| |
Collapse
|
16
|
Lin YM, Han JY, Lin CH, Lai YH. Optical Microphone-Based Speech Reconstruction System With Deep Learning for Individuals With Hearing Loss. IEEE Trans Biomed Eng 2023; 70:3330-3341. [PMID: 37327105 DOI: 10.1109/tbme.2023.3285437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
OBJECTIVE Although many speech enhancement (SE) algorithms have been proposed to promote speech perception in hearing-impaired patients, the conventional SE approaches that perform well under quiet and/or stationary noises fail under nonstationary noises and/or when the speaker is at a considerable distance. Therefore, the objective of this study is to overcome the limitations of the conventional speech enhancement approaches. METHOD This study proposes a speaker-closed deep learning-based SE method together with an optical microphone to acquire and enhance the speech of a target speaker. RESULTS The objective evaluation scores achieved by the proposed method outperformed the baseline methods by a margin of 0.21-0.27 and 0.34-0.64 in speech quality (HASQI) and speech comprehension/intelligibility (HASPI), respectively, for seven typical hearing loss types. CONCLUSION The results suggest that the proposed method can enhance speech perception by cutting off noise from speech signals and mitigating interference caused by distance. SIGNIFICANCE The results of this study show a potential way that can help improve the listening experience in enhancing speech quality and speech comprehension/intelligibility for hearing-impaired people.
Collapse
|
17
|
Fan C, Zhang H, Li A, Xiang W, Zheng C, Lv Z, Wu X. CompNet: Complementary network for single-channel speech enhancement. Neural Netw 2023; 168:508-517. [PMID: 37832318 DOI: 10.1016/j.neunet.2023.09.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 08/18/2023] [Accepted: 09/24/2023] [Indexed: 10/15/2023]
Abstract
Recent multi-domain processing methods have demonstrated promising performance for monaural speech enhancement tasks. However, few of them explain why they behave better over single-domain approaches. As an attempt to fill this gap, this paper presents a complementary single-channel speech enhancement network (CompNet) that demonstrates promising denoising capabilities and provides a unique perspective to understand the improvements introduced by multi-domain processing. Specifically, the noisy speech is initially enhanced through a time-domain network. However, despite the waveform can be feasibly recovered, the distribution of the time-frequency bins may still be partly different from the target spectrum when we reconsider the problem in the frequency domain. To solve this problem, we design a dedicated dual-path network as a post-processing module to independently filter the magnitude and refine the phase. This further drives the estimated spectrum to closely approximate the target spectrum in the time-frequency domain. We conduct extensive experiments with the WSJ0-SI84 and VoiceBank + Demand datasets. Objective test results show that the performance of the proposed system is highly competitive with existing systems.
Collapse
Affiliation(s)
- Cunhang Fan
- Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China
| | - Hongmei Zhang
- Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China
| | - Andong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| | - Wang Xiang
- Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China
| | - Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China
| | - Zhao Lv
- Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Xiaopei Wu
- Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| |
Collapse
|
18
|
Hou Z, Hu Q, Chen K, Cao Z, Lu J. Local spectral attention for full-band speech enhancement. JASA EXPRESS LETTERS 2023; 3:115201. [PMID: 37916951 DOI: 10.1121/10.0022268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 10/19/2023] [Indexed: 11/03/2023]
Abstract
Attention mechanism has been widely used in speech enhancement (SE) because, theoretically, it can effectively model the inherent connection of signal in time domain and spectrum domain. In this Letter, it is found that the attention over the entire frequency range hampers the inference for full-band SE and possibly leads to excessive residual noise and degradation of speech. To alleviate this problem, the local spectral attention is introduced into full-band SE model by limiting the span of attention. The ablation tests on three full-band SE models reveal that the local frequency attention can effectively improve overall performance.
Collapse
Affiliation(s)
- Zhongshu Hou
- Key Laboratory of Modern Acoustics, Nanjing University, Nanjing, 210093, China
- Nanjing University-Horizon Intelligent Audio Lab, Horizon Robotics, Beijing 100094, China
- Nanjing Institute of Advanced Artificial Intelligence, Nanjing 210014, China
| | - Qinwen Hu
- Key Laboratory of Modern Acoustics, Nanjing University, Nanjing, 210093, China
- Nanjing University-Horizon Intelligent Audio Lab, Horizon Robotics, Beijing 100094, China
- Nanjing Institute of Advanced Artificial Intelligence, Nanjing 210014, China
| | - Kai Chen
- Key Laboratory of Modern Acoustics, Nanjing University, Nanjing, 210093, China
- Nanjing University-Horizon Intelligent Audio Lab, Horizon Robotics, Beijing 100094, China
- Nanjing Institute of Advanced Artificial Intelligence, Nanjing 210014, China
| | - Zhanzhong Cao
- Nanjing Institute of Information Technology, Nanjing 210036, , , , ,
| | - Jing Lu
- Key Laboratory of Modern Acoustics, Nanjing University, Nanjing, 210093, China
- Nanjing University-Horizon Intelligent Audio Lab, Horizon Robotics, Beijing 100094, China
- Nanjing Institute of Advanced Artificial Intelligence, Nanjing 210014, China
| |
Collapse
|
19
|
Li G, Fu M, Sun M, Liu X, Zheng B. A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model. SENSORS (BASEL, SWITZERLAND) 2023; 23:8770. [PMID: 37960477 PMCID: PMC10647675 DOI: 10.3390/s23218770] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 10/24/2023] [Accepted: 10/25/2023] [Indexed: 11/15/2023]
Abstract
The cocktail party problem can be more effectively addressed by leveraging the speaker's visual and audio information. This paper proposes a method to improve the audio's separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.
Collapse
Affiliation(s)
- Guizhu Li
- College of Electronic Engineering, Ocean University of China, Qingdao 266100, China
| | - Min Fu
- College of Electronic Engineering, Ocean University of China, Qingdao 266100, China
- Sanya Oceanography Institution, Ocean University of China, Sanya 572024, China
| | - Mengnan Sun
- College of Electronic Engineering, Ocean University of China, Qingdao 266100, China
| | - Xuefeng Liu
- College of Automation and Electronic Engineering, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Bing Zheng
- College of Electronic Engineering, Ocean University of China, Qingdao 266100, China
- Sanya Oceanography Institution, Ocean University of China, Sanya 572024, China
| |
Collapse
|
20
|
Triantafyllopoulos A, Kathan A, Baird A, Christ L, Gebhard A, Gerczuk M, Karas V, Hübner T, Jing X, Liu S, Mallol-Ragolta A, Milling M, Ottl S, Semertzidou A, Rajamani ST, Yan T, Yang Z, Dineley J, Amiriparian S, Bartl-Pokorny KD, Batliner A, Pokorny FB, Schuller BW. HEAR4Health: a blueprint for making computer audition a staple of modern healthcare. Front Digit Health 2023; 5:1196079. [PMID: 37767523 PMCID: PMC10520966 DOI: 10.3389/fdgth.2023.1196079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 09/01/2023] [Indexed: 09/29/2023] Open
Abstract
Recent years have seen a rapid increase in digital medicine research in an attempt to transform traditional healthcare systems to their modern, intelligent, and versatile equivalents that are adequately equipped to tackle contemporary challenges. This has led to a wave of applications that utilise AI technologies; first and foremost in the fields of medical imaging, but also in the use of wearables and other intelligent sensors. In comparison, computer audition can be seen to be lagging behind, at least in terms of commercial interest. Yet, audition has long been a staple assistant for medical practitioners, with the stethoscope being the quintessential sign of doctors around the world. Transforming this traditional technology with the use of AI entails a set of unique challenges. We categorise the advances needed in four key pillars: Hear, corresponding to the cornerstone technologies needed to analyse auditory signals in real-life conditions; Earlier, for the advances needed in computational and data efficiency; Attentively, for accounting to individual differences and handling the longitudinal nature of medical data; and, finally, Responsibly, for ensuring compliance to the ethical standards accorded to the field of medicine. Thus, we provide an overview and perspective of HEAR4Health: the sketch of a modern, ubiquitous sensing system that can bring computer audition on par with other AI technologies in the strive for improved healthcare systems.
Collapse
Affiliation(s)
- Andreas Triantafyllopoulos
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Alexander Kathan
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Alice Baird
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Lukas Christ
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Alexander Gebhard
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Maurice Gerczuk
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Vincent Karas
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Tobias Hübner
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Xin Jing
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Shuo Liu
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Adria Mallol-Ragolta
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
- Centre for Interdisciplinary Health Research, University of Augsburg, Augsburg, Germany
| | - Manuel Milling
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Sandra Ottl
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Anastasia Semertzidou
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | | | - Tianhao Yan
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Zijiang Yang
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Judith Dineley
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Shahin Amiriparian
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Katrin D. Bartl-Pokorny
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
- Division of Phoniatrics, Medical University of Graz, Graz, Austria
| | - Anton Batliner
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
| | - Florian B. Pokorny
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
- Division of Phoniatrics, Medical University of Graz, Graz, Austria
- Centre for Interdisciplinary Health Research, University of Augsburg, Augsburg, Germany
| | - Björn W. Schuller
- EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
- Centre for Interdisciplinary Health Research, University of Augsburg, Augsburg, Germany
- GLAM – Group on Language, Audio, & Music, Imperial College London, London, United Kingdom
| |
Collapse
|
21
|
Henry F, Parsi A, Glavin M, Jones E. Experimental Investigation of Acoustic Features to Optimize Intelligibility in Cochlear Implants. SENSORS (BASEL, SWITZERLAND) 2023; 23:7553. [PMID: 37688009 PMCID: PMC10490615 DOI: 10.3390/s23177553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Revised: 08/21/2023] [Accepted: 08/28/2023] [Indexed: 09/10/2023]
Abstract
Although cochlear implants work well for people with hearing impairment in quiet conditions, it is well-known that they are not as effective in noisy environments. Noise reduction algorithms based on machine learning allied with appropriate speech features can be used to address this problem. The purpose of this study is to investigate the importance of acoustic features in such algorithms. Acoustic features are extracted from speech and noise mixtures and used in conjunction with the ideal binary mask to train a deep neural network to estimate masks for speech synthesis to produce enhanced speech. The intelligibility of this speech is objectively measured using metrics such as Short-time Objective Intelligibility (STOI), Hit Rate minus False Alarm Rate (HIT-FA) and Normalized Covariance Measure (NCM) for both simulated normal-hearing and hearing-impaired scenarios. A wide range of existing features is experimentally evaluated, including features that have not been traditionally applied in this application. The results demonstrate that frequency domain features perform best. In particular, Gammatone features performed best for normal hearing over a range of signal-to-noise ratios and noise types (STOI = 0.7826). Mel spectrogram features exhibited the best overall performance for hearing impairment (NCM = 0.7314). There is a stronger correlation between STOI and NCM than HIT-FA and NCM, suggesting that the former is a better predictor of intelligibility for hearing-impaired listeners. The results of this study may be useful in the design of adaptive intelligibility enhancement systems for cochlear implants based on both the noise level and the nature of the noise (stationary or non-stationary).
Collapse
Affiliation(s)
- Fergal Henry
- Department of Computing and Electronic Engineering, Atlantic Technological University Sligo, Ash Lane, F91 YW50 Sligo, Ireland
| | - Ashkan Parsi
- Electrical and Electronic Engineering, University of Galway, University Road, H91 TK33 Galway, Ireland; (A.P.); (M.G.); (E.J.)
| | - Martin Glavin
- Electrical and Electronic Engineering, University of Galway, University Road, H91 TK33 Galway, Ireland; (A.P.); (M.G.); (E.J.)
| | - Edward Jones
- Electrical and Electronic Engineering, University of Galway, University Road, H91 TK33 Galway, Ireland; (A.P.); (M.G.); (E.J.)
| |
Collapse
|
22
|
Yang Y, Pandey A, Wang D. Time-Domain Speech Enhancement for Robust Automatic Speech Recognition. INTERSPEECH 2023; 2023:4913-4917. [PMID: 40313476 PMCID: PMC12045131 DOI: 10.21437/interspeech.2023-167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2025]
Abstract
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement algorithms. However, speech enhancement has not been established as an effective frontend for robust automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between speech enhancement and ASR impedes the progress of robust ASR systems especially as speech enhancement has made big strides in recent years. In this work, we focus on eliminating this divide with an ARN (attentive recurrent network) based time-domain enhancement model. The proposed system fully decouples speech enhancement and an acoustic model trained only on clean speech. Results on the CHiME-2 corpus show that ARN enhanced speech translates to improved ASR results. The proposed system achieves 6.28% average word error rate, outperforming the previous best by 19.3% relatively.
Collapse
Affiliation(s)
- Yufeng Yang
- Department of Computer Science and Engineering, The Ohio State University, USA
| | - Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, The Ohio State University, USA
- Center for Cognitive and Brain Sciences, The Ohio State University, USA
| |
Collapse
|
23
|
Peracha FK, Khattak MI, Salem N, Saleem N. Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network. PLoS One 2023; 18:e0285629. [PMID: 37167227 PMCID: PMC10174555 DOI: 10.1371/journal.pone.0285629] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 04/26/2023] [Indexed: 05/13/2023] Open
Abstract
Speech enhancement (SE) reduces background noise signals in target speech and is applied at the front end in various real-world applications, including robust ASRs and real-time processing in mobile phone communications. SE systems are commonly integrated into mobile phones to increase quality and intelligibility. As a result, a low-latency system is required to operate in real-world applications. On the other hand, these systems need efficient optimization. This research focuses on the single-microphone SE operating in real-time systems with better optimization. We propose a causal data-driven model that uses attention encoder-decoder long short-term memory (LSTM) to estimate the time-frequency mask from a noisy speech in order to make a clean speech for real-time applications that need low-latency causal processing. The encoder-decoder LSTM and a causal attention mechanism are used in the proposed model. Furthermore, a dynamical-weighted (DW) loss function is proposed to improve model learning by varying the weight loss values. Experiments demonstrated that the proposed model consistently improves voice quality, intelligibility, and noise suppression. In the causal processing mode, the LSTM-based estimated suppression time-frequency mask outperforms the baseline model for unseen noise types. The proposed SE improved the STOI by 2.64% (baseline LSTM-IRM), 6.6% (LSTM-KF), 4.18% (DeepXi-KF), and 3.58% (DeepResGRU-KF). In addition, we examine word error rates (WERs) using Google's Automatic Speech Recognition (ASR). The ASR results show that error rates decreased from 46.33% (noisy signals) to 13.11% (proposed) 15.73% (LSTM), and 14.97% (LSTM-KF).
Collapse
Affiliation(s)
- Fahad Khalil Peracha
- Department of Electrical Engineering, University of Engineering and Technology, Peshawar, KPK, Pakistan
| | - Muhammad Irfan Khattak
- Department of Electrical Engineering, University of Engineering and Technology, Peshawar, KPK, Pakistan
| | - Nema Salem
- Electrical and Computer Engineering Department, Effat College of Engineering, Effat University, Jeddah, KSA
| | - Nasir Saleem
- Department of Electrical Engineering, University of Engineering and Technology, Peshawar, KPK, Pakistan
| |
Collapse
|
24
|
Bellur A, Thakkar K, Elhilali M. Explicit-memory multiresolution adaptive framework for speech and music separation. EURASIP JOURNAL ON AUDIO, SPEECH, AND MUSIC PROCESSING 2023; 2023:20. [PMID: 37181589 PMCID: PMC10169896 DOI: 10.1186/s13636-023-00286-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Accepted: 04/21/2023] [Indexed: 05/16/2023]
Abstract
The human auditory system employs a number of principles to facilitate the selection of perceptually separated streams from a complex sound mixture. The brain leverages multi-scale redundant representations of the input and uses memory (or priors) to guide the selection of a target sound from the input mixture. Moreover, feedback mechanisms refine the memory constructs resulting in further improvement of selectivity of a particular sound object amidst dynamic backgrounds. The present study proposes a unified end-to-end computational framework that mimics these principles for sound source separation applied to both speech and music mixtures. While the problems of speech enhancement and music separation have often been tackled separately due to constraints and specificities of each signal domain, the current work posits that common principles for sound source separation are domain-agnostic. In the proposed scheme, parallel and hierarchical convolutional paths map input mixtures onto redundant but distributed higher-dimensional subspaces and utilize the concept of temporal coherence to gate the selection of embeddings belonging to a target stream abstracted in memory. These explicit memories are further refined through self-feedback from incoming observations in order to improve the system's selectivity when faced with unknown backgrounds. The model yields stable outcomes of source separation for both speech and music mixtures and demonstrates benefits of explicit memory as a powerful representation of priors that guide information selection from complex inputs.
Collapse
Affiliation(s)
- Ashwin Bellur
- Electrical and Computer Engineering, Johns Hopkins University, Baltimore, USA
| | - Karan Thakkar
- Electrical and Computer Engineering, Johns Hopkins University, Baltimore, USA
| | - Mounya Elhilali
- Electrical and Computer Engineering, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
25
|
Healy EW, Johnson EM, Pandey A, Wang D. Progress made in the efficacy and viability of deep-learning-based noise reduction. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:2751. [PMID: 37133814 PMCID: PMC10159658 DOI: 10.1121/10.0019341] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2022] [Revised: 04/17/2023] [Accepted: 04/17/2023] [Indexed: 05/04/2023]
Abstract
Recent years have brought considerable advances to our ability to increase intelligibility through deep-learning-based noise reduction, especially for hearing-impaired (HI) listeners. In this study, intelligibility improvements resulting from a current algorithm are assessed. These benefits are compared to those resulting from the initial demonstration of deep-learning-based noise reduction for HI listeners ten years ago in Healy, Yoho, Wang, and Wang [(2013). J. Acoust. Soc. Am. 134, 3029-3038]. The stimuli and procedures were broadly similar across studies. However, whereas the initial study involved highly matched training and test conditions, as well as non-causal operation, preventing its ability to operate in the real world, the current attentive recurrent network employed different noise types, talkers, and speech corpora for training versus test, as required for generalization, and it was fully causal, as required for real-time operation. Significant intelligibility benefit was observed in every condition, which averaged 51% points across conditions for HI listeners. Further, benefit was comparable to that obtained in the initial demonstration, despite the considerable additional demands placed on the current algorithm. The retention of large benefit despite the systematic removal of various constraints as required for real-world operation reflects the substantial advances made to deep-learning-based noise reduction.
Collapse
Affiliation(s)
- Eric W Healy
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - Eric M Johnson
- Department of Speech and Hearing Science, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - Ashutosh Pandey
- Department of Computer Science and Engineering, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, and Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
26
|
Rascon C. Characterization of Deep Learning-Based Speech-Enhancement Techniques in Online Audio Processing Applications. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23094394. [PMID: 37177598 PMCID: PMC10181690 DOI: 10.3390/s23094394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 04/24/2023] [Accepted: 04/28/2023] [Indexed: 05/15/2023]
Abstract
Deep learning-based speech-enhancement techniques have recently been an area of growing interest, since their impressive performance can potentially benefit a wide variety of digital voice communication systems. However, such performance has been evaluated mostly in offline audio-processing scenarios (i.e., feeding the model, in one go, a complete audio recording, which may extend several seconds). It is of significant interest to evaluate and characterize the current state-of-the-art in applications that process audio online (i.e., feeding the model a sequence of segments of audio data, concatenating the results at the output end). Although evaluations and comparisons between speech-enhancement techniques have been carried out before, as far as the author knows, the work presented here is the first that evaluates the performance of such techniques in relation to their online applicability. This means that this work measures how the output signal-to-interference ratio (as a separation metric), the response time, and memory usage (as online metrics) are impacted by the input length (the size of audio segments), in addition to the amount of noise, amount and number of interferences, and amount of reverberation. Three popular models were evaluated, given their availability on public repositories and online viability, MetricGAN+, Spectral Feature Mapping with Mimic Loss, and Demucs-Denoiser. The characterization was carried out using a systematic evaluation protocol based on the Speechbrain framework. Several intuitions are presented and discussed, and some recommendations for future work are proposed.
Collapse
Affiliation(s)
- Caleb Rascon
- Computer Science Department, Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autonoma de Mexico, Mexico City 3000, Mexico
| |
Collapse
|
27
|
Chen H, Zhang X. CGA-MGAN: Metric GAN Based on Convolution-Augmented Gated Attention for Speech Enhancement. ENTROPY (BASEL, SWITZERLAND) 2023; 25:e25040628. [PMID: 37190416 PMCID: PMC10137386 DOI: 10.3390/e25040628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 03/15/2023] [Accepted: 04/04/2023] [Indexed: 05/17/2023]
Abstract
In recent years, neural networks based on attention mechanisms have seen increasingly use in speech recognition, separation, and enhancement, as well as other fields. In particular, the convolution-augmented transformer has performed well, as it can combine the advantages of convolution and self-attention. Recently, the gated attention unit (GAU) was proposed. Compared with traditional multi-head self-attention, approaches with GAU are effective and computationally efficient. In this CGA-MGAN: MetricGAN based on Convolution-augmented Gated Attention for Speech Enhancement, we propose a network for speech enhancement called CGA-MGAN, a kind of MetricGAN based on convolution-augmented gated attention. CGA-MGAN captures local and global correlations in speech signals at the same time by fusing convolution and gated attention units. Experiments on Voice Bank + DEMAND show that our proposed CGA-MGAN model achieves excellent performance (3.47 PESQ, 0.96 STOI, and 11.09 dB SSNR) with a relatively small model size (1.14 M).
Collapse
Affiliation(s)
- Haozhe Chen
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
- Key Laboratory of Electromagnetic Radiation and Sensing Technology, Chinese Academy of Sciences, Beijing 100190, China
- School of Electronic, Electrical, and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Xiaojuan Zhang
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
- Key Laboratory of Electromagnetic Radiation and Sensing Technology, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
28
|
Pandey A, Wang D. Attentive Training: A New Training Framework for Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2023; 31:1360-1370. [PMID: 37899765 PMCID: PMC10602021 DOI: 10.1109/taslp.2023.3260711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/31/2023]
Abstract
Dealing with speech interference in a speech enhancement system requires either speaker separation or target speaker extraction. Speaker separation has multiple output streams with arbitrary assignments while target speaker extraction requires additional cueing for speaker selection. Both of these are not suitable for a standalone speech enhancement system with one output stream. In this study, we propose a novel training framework, called Attentive Training, to extend speech enhancement to deal with speech interruptions. Attentive training is based on the observation that, in the real world, multiple talkers very unlikely start speaking at the same time, and therefore, a deep neural network can be trained to create a representation of the first speaker and utilize it to attend to or track that speaker in a multitalker noisy mixture. We present experimental results and comparisons to demonstrate the effectiveness of attentive training for speech enhancement.
Collapse
Affiliation(s)
- Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
29
|
Li F, Hu Y, Wang L. Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection. SENSORS (BASEL, SWITZERLAND) 2023; 23:3015. [PMID: 36991724 PMCID: PMC10056690 DOI: 10.3390/s23063015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 02/27/2023] [Accepted: 03/09/2023] [Indexed: 06/19/2023]
Abstract
Singing-voice separation is a separation task that involves a singing voice and musical accompaniment. In this paper, we propose a novel, unsupervised methodology for extracting a singing voice from the background in a musical mixture. This method is a modification of robust principal component analysis (RPCA) that separates a singing voice by using weighting based on gammatone filterbank and vocal activity detection. Although RPCA is a helpful method for separating voices from the music mixture, it fails when one single value, such as drums, is much larger than others (e.g., the accompanying instruments). As a result, the proposed approach takes advantage of varying values between low-rank (background) and sparse matrices (singing voice). Additionally, we propose an expanded RPCA on the cochleagram by utilizing coalescent masking on the gammatone. Finally, we utilize vocal activity detection to enhance the separation outcomes by eliminating the lingering music signal. Evaluation results reveal that the proposed approach provides superior separation outcomes than RPCA on ccMixter and DSD100 datasets.
Collapse
Affiliation(s)
- Feng Li
- Department of Computer Science and Technology, Anhui University of Finance and Economics, Bengbu 233030, China
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China
| | - Yujun Hu
- Department of Computer Science and Technology, Anhui University of Finance and Economics, Bengbu 233030, China
| | - Lingling Wang
- Department of Computer Science and Technology, Anhui University of Finance and Economics, Bengbu 233030, China
| |
Collapse
|
30
|
Hu Q, Hou Z, Chen K, Lu J. Learnable spectral dimension compression mapping for full-band speech enhancement. JASA EXPRESS LETTERS 2023; 3:025204. [PMID: 36858985 DOI: 10.1121/10.0017327] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
The highly imbalanced power spectral density of full-band speech signals poses a significant challenge to full-band speech enhancement, and the commonly used spectral features that mimic the behavior of the human auditory system are not an optimal choice for full-band speech enhancement. In this paper, a learnable spectral dimension compression mapping is proposed to effectively compress the spectral feature along frequency, preserving high resolution in low frequencies while compressing information in high frequencies in a more flexible manner. Experimental results verify that the proposed method can be easily combined with different full-band speech enhancement models and achieve better performance.
Collapse
Affiliation(s)
- Qinwen Hu
- Key Laboratory of Modern Acoustics, Nanjing University, Nanjing 210093, China , , ,
| | - Zhongshu Hou
- Key Laboratory of Modern Acoustics, Nanjing University, Nanjing 210093, China , , ,
| | - Kai Chen
- Key Laboratory of Modern Acoustics, Nanjing University, Nanjing 210093, China , , ,
| | - Jing Lu
- Key Laboratory of Modern Acoustics, Nanjing University, Nanjing 210093, China , , ,
| |
Collapse
|
31
|
Drgas S. A Survey on Low-Latency DNN-Based Speech Enhancement. SENSORS (BASEL, SWITZERLAND) 2023; 23:1380. [PMID: 36772421 PMCID: PMC9921748 DOI: 10.3390/s23031380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Revised: 01/19/2023] [Accepted: 01/23/2023] [Indexed: 06/18/2023]
Abstract
This paper presents recent advances in low-latency, single-channel, deep neural network-based speech enhancement systems. The sources of latency and their acceptable values in different applications are described. This is followed by an analysis of the constraints imposed on neural network architectures. Specifically, the causal units used in deep neural networks are presented and discussed in the context of their properties, such as the number of parameters, the receptive field, and computational complexity. This is followed by a discussion of techniques used to reduce the computational complexity and memory requirements of the neural networks used in this task. Finally, the techniques used by the winners of the latest speech enhancement challenges (DNS, Clarity) are shown and compared.
Collapse
Affiliation(s)
- Szymon Drgas
- Institute of Automatic Control and Robotics, Poznan University of Technology, Piotrowo 3A Street, 60-965 Poznan, Poland
| |
Collapse
|
32
|
Zheng C, Zhang H, Liu W, Luo X, Li A, Li X, Moore BCJ. Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods. Trends Hear 2023; 27:23312165231209913. [PMID: 37956661 PMCID: PMC10658184 DOI: 10.1177/23312165231209913] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 10/09/2023] [Indexed: 11/15/2023] Open
Abstract
Frequency-domain monaural speech enhancement has been extensively studied for over 60 years, and a great number of methods have been proposed and applied to many devices. In the last decade, monaural speech enhancement has made tremendous progress with the advent and development of deep learning, and performance using such methods has been greatly improved relative to traditional methods. This survey paper first provides a comprehensive overview of traditional and deep-learning methods for monaural speech enhancement in the frequency domain. The fundamental assumptions of each approach are then summarized and analyzed to clarify their limitations and advantages. A comprehensive evaluation of some typical methods was conducted using the WSJ + Deep Noise Suppression (DNS) challenge and Voice Bank + DEMAND datasets to give an intuitive and unified comparison. The benefits of monaural speech enhancement methods using objective metrics relevant for normal-hearing and hearing-impaired listeners were evaluated. The objective test results showed that compression of the input features was important for simulated normal-hearing listeners but not for simulated hearing-impaired listeners. Potential future research and development topics in monaural speech enhancement are suggested.
Collapse
Affiliation(s)
- Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Huiyong Zhang
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wenzhe Liu
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoxue Luo
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Andong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiaodong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Brian C. J. Moore
- Cambridge Hearing Group, Department of Psychology, University of Cambridge, Cambridge, UK
| |
Collapse
|
33
|
Liu TH, Chi JZ, Wu BL, Chen YS, Huang CH, Chu YS. Design and Implementation of Machine Tool Life Inspection System Based on Sound Sensing. SENSORS (BASEL, SWITZERLAND) 2022; 23:284. [PMID: 36616882 PMCID: PMC9823646 DOI: 10.3390/s23010284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 12/15/2022] [Accepted: 12/23/2022] [Indexed: 06/17/2023]
Abstract
The main causes of damage to industrial machinery are aging, corrosion, and the wear of parts, which affect the accuracy of machinery and product precision. Identifying problems early and predicting the life cycle of a machine for early maintenance can avoid costly plant failures. Compared with other sensing and monitoring instruments, sound sensors are inexpensive, portable, and have less computational data. This paper proposed a machine tool life cycle model with noise reduction. The life cycle model uses Mel-Frequency Cepstral Coefficients (MFCC) to extract audio features. A Deep Neural Network (DNN) is used to understand the relationship between audio features and life cycle, and then determine the audio signal corresponding to the aging degree. The noise reduction model simulates the actual environment by adding noise and extracts features by Power Normalized Cepstral Coefficients (PNCC), and designs Mask as the DNN's learning target to eliminate the effect of noise. The effect of the denoising model is improved by 6.8% under Short-Time Objective Intelligibility (STOI). There is a 3.9% improvement under Perceptual Evaluation of Speech Quality (PESQ). The life cycle model accuracy before denoising is 76%. After adding the noise reduction system, the accuracy of the life cycle model is increased to 80%.
Collapse
Affiliation(s)
- Tsung-Hsien Liu
- Communications Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Jun-Zhe Chi
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Bo-Lin Wu
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Yee-Shao Chen
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Chung-Hsun Huang
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Yuan-Sun Chu
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| |
Collapse
|
34
|
Liu S, Mallol-Ragolta A, Parada-Cabaleiro E, Qian K, Jing X, Kathan A, Hu B, Schuller BW. Audio self-supervised learning: A survey. PATTERNS (NEW YORK, N.Y.) 2022; 3:100616. [PMID: 36569546 PMCID: PMC9768631 DOI: 10.1016/j.patter.2022.100616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.
Collapse
Affiliation(s)
- Shuo Liu
- Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany,Corresponding author
| | - Adria Mallol-Ragolta
- Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany
| | | | - Kun Qian
- School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Xin Jing
- Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany
| | - Alexander Kathan
- Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany
| | - Bin Hu
- School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Björn W. Schuller
- Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, 86159 Augsburg, Germany,GLAM – the Group on Language, Audio, & Music, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
35
|
Hepsiba D, Justin J. Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN. Soft comput 2022. [DOI: 10.1007/s00500-021-06291-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
36
|
Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order. Symmetry (Basel) 2022. [DOI: 10.3390/sym14122514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
The multichannel variational autoencoder (MVAE) integrates the rule-based update of a separation matrix and the deep generative model and proves to be a competitive speech separation method. However, the output (global) permutation ambiguity still exists and turns out to be a fundamental problem in applications. In this paper, we address this problem by employing two dedicated encoders. One encodes the speaker identity for the guidance of the output sorting, and the other encodes the linguistic information for the reconstruction of the source signals. The instance normalization (IN) and the adaptive instance normalization (adaIN) are applied to the networks to disentangle the speaker representations from the content representations. The separated sources are arranged in designated order by a symmetric permutation alignment scheme. In the experiments, we test the proposed method in different gender combinations and various reverberant conditions and generalize it to unseen speakers. The results validate its reliable sorting accuracy and good separation performance. The proposed method outperforms the other baseline methods and maintains stable performance, achieving over 20 dB SIR improvement even in high reverberant environments.
Collapse
|
37
|
Multichannel KHMF for speech separation with enthalpy based DOA and score based CNN (SCNN). EVOLVING SYSTEMS 2022. [DOI: 10.1007/s12530-022-09473-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
38
|
Šarić Z, Subotić M, Bilibajkić R, Barjaktarović M, Stojanović J. Supervised speech separation combined with adaptive beamforming. COMPUT SPEECH LANG 2022. [DOI: 10.1016/j.csl.2022.101409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
39
|
Shi J, Chang X, Watanabe S, Xu B. Train from scratch: Single-stage joint training of speech separation and recognition. COMPUT SPEECH LANG 2022. [DOI: 10.1016/j.csl.2022.101387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
40
|
Zhong D, Hu Y, Zhao K, Deng W, Hou P, Zhang J. Accurate separation of mixed high-dimension optical-chaotic signals using optical reservoir computing based on optically pumped VCSELs. OPTICS EXPRESS 2022; 30:39561-39581. [PMID: 36298905 DOI: 10.1364/oe.470857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 09/28/2022] [Indexed: 06/16/2023]
Abstract
In this work, with the mixing fractions being known in advance or unknown, the schemes and theories for the separations of two groups of the mixed optical chaotic signals are proposed in detail, using the VCSEL-based reservoir computing (RC) systems. Here, two groups of the mixed optical chaotic signals are linearly combined with many beams of the chaotic x-polarization components (X-PCs) and Y-PCs emitted by the optically pumped spin-VCSELs operation alone. Two parallel reservoirs are performed by using the chaotic X-PC and Y-PC output by the optically pumped spin-VCSEL with both optical feedback and optical injection. Moreover, we further demonstrate the separation performances of the mixed chaotic signal linearly combined with no more than three beams of the chaotic X-PC or Y-PC. We find that two groups of the mixed optical chaos signals can be effectively separated by using two reservoirs in single RC system based on optically pumped Spin-VCSEL and their corresponding separated errors characterized by the training errors are no more than 0.093, when the mixing fractions are known as a certain value in advance. If the mixing fractions are unknown, we utilize two cascaded RC systems based on optically pumped Spin-VCSELs to separate each group of the mixed optical signals. The mixing fractions can be accurate predicted by using two parallel reservoirs in the first RC system. Based on the values of the predictive mixing fractions, two groups of the mixed optical chaos signals can be effectively separated by utilizing two parallel reservoirs in the second RC system, and their separated errors also are no more than 0.093. In the same way, the mixed optical chaos signal linearly superimposed with more than three beams of optical chaotic signals can be effectively separated. The method and idea for separation of complex optical chaos signals proposed by this paper may provide an impact to development of novel principles of multiple access and demultiplexing in multi-channel chaotic cryptography communication.
Collapse
|
41
|
Chou KF, Boyd AD, Best V, Colburn HS, Sen K. A biologically oriented algorithm for spatial sound segregation. Front Neurosci 2022; 16:1004071. [PMID: 36312015 PMCID: PMC9614053 DOI: 10.3389/fnins.2022.1004071] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 09/28/2022] [Indexed: 11/13/2022] Open
Abstract
Listening in an acoustically cluttered scene remains a difficult task for both machines and hearing-impaired listeners. Normal-hearing listeners accomplish this task with relative ease by segregating the scene into its constituent sound sources, then selecting and attending to a target source. An assistive listening device that mimics the biological mechanisms underlying this behavior may provide an effective solution for those with difficulty listening in acoustically cluttered environments (e.g., a cocktail party). Here, we present a binaural sound segregation algorithm based on a hierarchical network model of the auditory system. In the algorithm, binaural sound inputs first drive populations of neurons tuned to specific spatial locations and frequencies. The spiking response of neurons in the output layer are then reconstructed into audible waveforms via a novel reconstruction method. We evaluate the performance of the algorithm with a speech-on-speech intelligibility task in normal-hearing listeners. This two-microphone-input algorithm is shown to provide listeners with perceptual benefit similar to that of a 16-microphone acoustic beamformer. These results demonstrate the promise of this biologically inspired algorithm for enhancing selective listening in challenging multi-talker scenes.
Collapse
Affiliation(s)
- Kenny F. Chou
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Alexander D. Boyd
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Virginia Best
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, MA, United States
| | - H. Steven Colburn
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Kamal Sen
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
- *Correspondence: Kamal Sen,
| |
Collapse
|
42
|
Wang H, Zhang X, Wang D. Fusing Bone-conduction and Air-conduction Sensors for Complex-Domain Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2022; 30:3134-3143. [PMID: 37124143 PMCID: PMC10147322 DOI: 10.1109/taslp.2022.3209943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Speech enhancement aims to improve the listening quality and intelligibility of noisy speech in adverse environments. It proves to be challenging to perform speech enhancement in very low signal-to-noise ratio (SNR) conditions. Conventional speech enhancement utilizes air-conduction (AC) microphones, which are sensitive to background noise but capable of capturing full-band signals. On the other hand, bone-conduction (BC) sensors are unaffected by acoustic noise, but recorded speech has limited bandwidth. This study proposes an attention-based fusion method to combine the strengths of AC and BC signals and perform complex spectral mapping for speech enhancement. Experiments on the EMSB dataset demonstrate that the proposed approach effectively leverages the advantages of AC and BC sensors, and outperforms a recent time-domain baseline in all conditions. We also show that the sensor fusion method is superior to single-sensor counterparts, especially in low SNR conditions. As the amount of BC data is very limited, we additionally propose a semi-supervised technique to utilize both parallelly and unparallely recorded AC and BC speech signals. With additional AC speech from the AISHELL-1 dataset, we achieve similar performance to supervised learning with only 50% parallel data.
Collapse
Affiliation(s)
- Heming Wang
- Department of Computer Science and Engineering, The Ohio State University, OH 43210 USA
| | - Xueliang Zhang
- Department of Computer Science, Inner Mongolia University, Hohhot 010021, China
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
43
|
Zhang K, Liu T, Song S, Zhao X, Sun S, Metzner W, Feng J, Liu Y. Separating overlapping bat calls with a bi-directional long short-term memory network. Integr Zool 2022; 17:741-751. [PMID: 33881210 DOI: 10.1111/1749-4877.12549] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Acquiring clear acoustic signals is critical for the analysis of animal vocalizations. Bioacoustics studies commonly face the problem of overlapping signals, which can impede the structural identification of vocal units, but there is currently no satisfactory solution. This study presents a bi-directional long short-term memory network to separate overlapping echolocation-communication calls of 6 different bat species and reconstruct waveforms. The separation quality was evaluated using 7 temporal-spectrum parameters. All the echolocation pulses and syllables of communication calls in the overlapping signals were separated and parameter comparisons showed no significant difference and negligible deviation between the extracted and original calls. Clustering analysis was conducted with separated echolocation calls from each bat species to provide an example of practical application of the separated and reconstructed calls. The result of clustering analysis showed high corrected rand index (82.79%), suggesting the reconstructed waveforms could be reliably used for species classification. These results demonstrate a convenient and automated approach for separating overlapping calls. The study extends the application of deep neural networks to separate overlapping animal sounds.
Collapse
Affiliation(s)
- Kangkang Zhang
- Jilin Provincial Key Laboratory of Animal Resource Conservation and Utilization, Northeast Normal University, Changchun, China
| | - Tong Liu
- Jilin Provincial Key Laboratory of Animal Resource Conservation and Utilization, Northeast Normal University, Changchun, China
| | - Shengjing Song
- Jilin Provincial Key Laboratory of Animal Resource Conservation and Utilization, Northeast Normal University, Changchun, China
| | - Xin Zhao
- Jilin Provincial Key Laboratory of Animal Resource Conservation and Utilization, Northeast Normal University, Changchun, China
| | - Shijun Sun
- School of Environment, Northeast Normal University, Changchun, China
| | - Walter Metzner
- Department of Integrative Biology and Physiology, University of California, Los Angeles, California, USA
| | - Jiang Feng
- Jilin Provincial Key Laboratory of Animal Resource Conservation and Utilization, Northeast Normal University, Changchun, China
- Collage of Animal Science and Technology, Jilin Agricultural University, Changchun, China
| | - Ying Liu
- Jilin Provincial Key Laboratory of Animal Resource Conservation and Utilization, Northeast Normal University, Changchun, China
- Key Laboratory for Vegetation Ecology, Ministry of Education, Northeast Normal University, Changchun, China
| |
Collapse
|
44
|
Toward Personalized Diagnosis and Therapy for Hearing Loss: Insights From Cochlear Implants. Otol Neurotol 2022; 43:e903-e909. [PMID: 35970169 DOI: 10.1097/mao.0000000000003624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
ABSTRACT Sensorineural hearing loss (SNHL) is the most common sensory deficit, disabling nearly half a billion people worldwide. The cochlear implant (CI) has transformed the treatment of patients with SNHL, having restored hearing to more than 800,000 people. The success of CIs has inspired multidisciplinary efforts to address the unmet need for personalized, cellular-level diagnosis, and treatment of patients with SNHL. Current limitations include an inability to safely and accurately image at high resolution and biopsy the inner ear, precluding the use of key structural and molecular information during diagnostic and treatment decisions. Furthermore, there remains a lack of pharmacological therapies for hearing loss, which can partially be attributed to challenges associated with new drug development. We highlight advances in diagnostic and therapeutic strategies for SNHL that will help accelerate the push toward precision medicine. In addition, we discuss technological improvements for the CI that will further enhance its functionality for future patients. This report highlights work that was originally presented by Dr. Stankovic as part of the Dr. John Niparko Memorial Lecture during the 2021 American Cochlear Implant Alliance annual meeting.
Collapse
|
45
|
Lee GW, Kim HK. Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition. SENSORS (BASEL, SWITZERLAND) 2022; 22:5381. [PMID: 35891070 PMCID: PMC9324918 DOI: 10.3390/s22145381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 07/16/2022] [Accepted: 07/17/2022] [Indexed: 06/15/2023]
Abstract
In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.
Collapse
Affiliation(s)
- Geon Woo Lee
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Korea;
| | - Hong Kook Kim
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Korea;
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Korea
| |
Collapse
|
46
|
A survey on deep reinforcement learning for audio-based applications. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10224-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
AbstractDeep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields including computer vision, natural language processing, healthcare, robotics, to name a few. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly from speech, music and other sound signals in order to create audio-based autonomous systems that have many promising applications in the real world. In this article, we conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together research studies across different but related areas in speech and music. We begin with an introduction to the general field of DL and reinforcement learning (RL), then progress to the main DRL methods and their applications in the audio domain. We conclude by presenting important challenges faced by audio-based DRL agents and by highlighting open areas for future research and investigation. The findings of this paper will guide researchers interested in DRL for the audio domain.
Collapse
|
47
|
|
48
|
Grumiaux PA, Kitić S, Girin L, Guérin A. A survey of sound source localization with deep learning methods. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:107. [PMID: 35931500 DOI: 10.1121/10.0011809] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 06/06/2022] [Indexed: 06/15/2023]
Abstract
This article is a survey of deep learning methods for single and multiple sound source localization, with a focus on sound source localization in indoor environments, where reverberation and diffuse noise are present. We provide an extensive topography of the neural network-based sound source localization literature in this context, organized according to the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. Tables summarizing the literature survey are provided at the end of the paper, allowing a quick search of methods with a given set of target characteristics.
Collapse
Affiliation(s)
- Pierre-Amaury Grumiaux
- Nantes Université, École Centrale Nantes, CNRS, LS2N, 2 chemin de la Houssinière, F-44332 Nantes, France
| | - Srđan Kitić
- Orange Labs, 4 Rue du Clos Courtel, 35510 Cesson-Sévigné, France
| | - Laurent Girin
- Univ. Grenoble Alpes, Grenoble-INP, GIPSA-lab, 11 Rue des Mathématiques, 38400 Saint-Martin-d'Hères, France
| | - Alexandre Guérin
- Orange Labs, 4 Rue du Clos Courtel, 35510 Cesson-Sévigné, France
| |
Collapse
|
49
|
Huang Y, Hao Y, Xu J, Xu B. Compressing speaker extraction model with ultra-low precision quantization and knowledge distillation. Neural Netw 2022; 154:13-21. [PMID: 35841810 DOI: 10.1016/j.neunet.2022.06.026] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 04/20/2022] [Accepted: 06/21/2022] [Indexed: 11/25/2022]
Abstract
Recently, our proposed speaker extraction model, WASE (learning When to Attend for Speaker Extraction) yielded superior performance over the prior state-of-the-art methods by explicitly modeling onset clue and regarding it as important guidance in speaker extraction tasks. However, it still remains challenging when it comes to the deployments on the resource-constrained devices, where the model must be tiny and fast to perform inference with minimal budget in CPU and memory while keeping the speaker extraction performance. In this work, we utilize model compression techniques to alleviate the problem and propose a lightweight speaker extraction model, TinyWASE, which aims to run on resource-constrained devices. Specifically, we mainly investigate the grouping effects of quantization-aware training and knowledge distillation techniques in the speaker extraction task and propose Distillation-aware Quantization. Experiments on WSJ0-2mix dataset show that our proposed model can achieve comparable performance as the full-precision model while reducing the model size using ultra-low bits (e.g. 3 bits), obtaining 8.97x compression ratio and 2.15 MB model size. We further show that TinyWASE can combine with other model compression techniques, such as parameter sharing, to achieve compression ratio as high as 23.81 with limited performance degradation. Our code is available at https://github.com/aispeech-lab/TinyWASE.
Collapse
Affiliation(s)
- Yating Huang
- Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China; School of Future Technology, University of Chinese Academy of Sciences, Beijing, China.
| | - Yunzhe Hao
- Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China; School of Future Technology, University of Chinese Academy of Sciences, Beijing, China.
| | - Jiaming Xu
- Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China.
| | - Bo Xu
- Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China; School of Future Technology, University of Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Center for Excellence in Brain Science and Intelligence Technology, CAS, Shanghai, China
| |
Collapse
|
50
|
A Multi-Source Separation Approach Based on DOA Cue and DNN. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12126224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
Multiple sound source separation in a reverberant environment has become popular in recent years. To improve the quality of the separated signal in a reverberant environment, a separation method based on a DOA cue and a deep neural network (DNN) is proposed in this paper. Firstly, a pre-processing model based on non-negative matrix factorization (NMF) is utilized for recorded signal dereverberation, which makes source separation more efficient. Then, we propose a multi-source separation algorithm combining sparse and non-sparse component points recovery to obtain each sound source signal from the dereverberated signal. For sparse component points, the dominant sound source for each sparse component point is determined by a DOA cue. For non-sparse component points, a DNN is used to recover each sound source signal. Finally, the signals separated from the sparse and non-sparse component points are well matched by temporal correlation to obtain each sound source signal. Both objective and subjective evaluation results indicate that compared with the existing method, the proposed separation approach shows a better performance in the case of a high-reverberation environment.
Collapse
|