Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Wang D, Chen J. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Trans Audio Speech Lang Process 2018;26:1702-1726. [PMID: 31223631 PMCID: PMC6586438 DOI: 10.1109/taslp.2018.2842159] [Citation(s) in RCA: 121] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]

For:	Wang D, Chen J. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Trans Audio Speech Lang Process 2018;26:1702-1726. [PMID: 31223631 PMCID: PMC6586438 DOI: 10.1109/taslp.2018.2842159] [Citation(s) in RCA: 121] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]

Number

Cited by Other Article(s)

Ullah R, Zhang S, Asif M, Wahab F. Multimodal learning-based speech enhancement and separation, recent innovations, new horizons, challenges and real-world applications. Comput Biol Med 2025;190:110082. [PMID: 40174498 DOI: 10.1016/j.compbiomed.2025.110082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 01/18/2025] [Accepted: 03/24/2025] [Indexed: 04/04/2025]

Abstract

With the increasing global prevalence of disabling hearing loss, speech enhancement technologies have become crucial for overcoming communication barriers and improving the quality of life for those affected. Multimodal learning has emerged as a powerful approach for speech enhancement and separation, integrating information from various sensory modalities such as audio signals, visual cues, and textual data. Despite substantial progress, challenges remain in synchronizing modalities, ensuring model robustness, and achieving scalability for real-time applications. This paper provides a comprehensive review of the latest advances in the most promising strategy, multimodal learning for speech enhancement and separation. We underscore the limitations of various methods in noisy and dynamic real-world environments and demonstrate how multimodal systems leverage complementary information from lip movements, text transcripts, and even brain signals to enhance performance. Critical deep learning architectures are covered, such as Transformers, Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), and generative models like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models. Various fusion strategies, including early and late fusion and attention mechanisms, are explored to address challenges in aligning and integrating multimodal inputs effectively. Furthermore, the paper explores important real-world applications in areas like automatic driver monitoring in autonomous vehicles, emotion recognition for mental health monitoring, augmented reality in interactive retail, smart surveillance for public safety, remote healthcare and telemedicine, and hearing assistive devices. Additionally, critical advanced procedures, comparisons, future challenges, and prospects are discussed to guide future research in multimodal learning for speech enhancement and separation, offering a roadmap for new horizons in this transformative field.

Collapse

Wang ZQ. SuperM2M: Supervised and mixture-to-mixture co-learning for speech enhancement and noise-robust ASR. Neural Netw 2025;188:107408. [PMID: 40157231 DOI: 10.1016/j.neunet.2025.107408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 01/31/2025] [Accepted: 03/13/2025] [Indexed: 04/01/2025]

Huang K, Liu M, Ma S. Nearly Optimal Learning Using Sparse Deep ReLU Networks in Regularized Empirical Risk Minimization With Lipschitz Loss. Neural Comput 2025;37:815-870. [PMID: 40030138 DOI: 10.1162/neco_a_01742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Accepted: 11/27/2024] [Indexed: 03/19/2025]

Im J, Pak S, Woo SY, Shin W, Lee ST. Flash Memory for Synaptic Plasticity in Neuromorphic Computing: A Review. Biomimetics (Basel) 2025;10:121. [PMID: 39997144 PMCID: PMC11852767 DOI: 10.3390/biomimetics10020121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2024] [Revised: 01/17/2025] [Accepted: 01/21/2025] [Indexed: 02/26/2025] Open

Vinotha R, Hepsiba D, Vijay Anand LD, Andrew J, Jennifer Eunice R. Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning. Sci Rep 2024;14:29455. [PMID: 39604526 PMCID: PMC11603152 DOI: 10.1038/s41598-024-80764-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 11/21/2024] [Indexed: 11/29/2024] Open

Abstract

Dysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems. This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR). A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems. The S-SEGAN integrates SEGAN's adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance. Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR. These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented). Evaluations demonstrate significant DSR accuracy improvements in DSE integration. The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.60% and 69.87%, respectively. The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.07% and C-HAN reaching 73%. The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.40%, while that of the Conformer model reaches 74.33%. Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.73% and 76.87%, respectively. Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.47% and 79.20%, respectively. Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.13% and 84.07%, respectively, with C-HAN displaying the highest performance among all proposed models.

Collapse

Henry F, Glavin M, Jones E, Parsi A. Impact of Mask Type as Training Target for Speech Intelligibility and Quality in Cochlear-Implant Noise Reduction. SENSORS (BASEL, SWITZERLAND) 2024;24:6614. [PMID: 39460094 PMCID: PMC11511210 DOI: 10.3390/s24206614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Revised: 10/06/2024] [Accepted: 10/10/2024] [Indexed: 10/28/2024]

Han JY, Li JH, Yang CS, Chen F, Liao WH, Liao YF, Lai YH. Leveraging Deep Learning to Enhance Optical Microphone System Performance with Unknown Speakers for Cochlear Implants. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024;2024:1-6. [PMID: 40039183 DOI: 10.1109/embc53108.2024.10782084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]

Lee GW, Kim HK. Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition. SENSORS (BASEL, SWITZERLAND) 2024;24:2573. [PMID: 38676191 PMCID: PMC11054889 DOI: 10.3390/s24082573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 04/08/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024]

Fan J, Williamson DS. From the perspective of perceptual speech quality: The robustness of frequency bands to noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024;155:1916-1927. [PMID: 38456734 DOI: 10.1121/10.0025272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 02/22/2024] [Indexed: 03/09/2024]

Cherukuru P, Mustafa MB. CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing. PeerJ Comput Sci 2024;10:e1901. [PMID: 38435554 PMCID: PMC10909157 DOI: 10.7717/peerj-cs.1901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 01/31/2024] [Indexed: 03/05/2024]

Abstract

Speech enhancement algorithms are applied in multiple levels of enhancement to improve the quality of speech signals under noisy environments known as multi-channel speech enhancement (MCSE) systems. Numerous existing algorithms are used to filter noise in speech enhancement systems, which are typically employed as a pre-processor to reduce noise and improve speech quality. They may, however, be limited in performing well under low signal-to-noise ratio (SNR) situations. The speech devices are exposed to all kinds of environmental noises which may go up to a high-level frequency of noises. The objective of this research is to conduct a noise reduction experiment for a multi-channel speech enhancement (MCSE) system in stationary and non-stationary environmental noisy situations with varying speech signal SNR levels. The experiments examined the performance of the existing and the proposed MCSE systems for environmental noises in filtering low to high SNRs environmental noises (-10 dB to 20 dB). The experiments were conducted using the AURORA and LibriSpeech datasets, which consist of different types of environmental noises. The existing MCSE (BAV-MCSE) makes use of beamforming, adaptive noise reduction and voice activity detection algorithms (BAV) to filter the noises from speech signals. The proposed MCSE (DWT-CNN-MCSE) system was developed based on discrete wavelet transform (DWT) preprocessing and convolution neural network (CNN) for denoising the input noisy speech signals to improve the performance accuracy. The performance of the existing BAV-MCSE and the proposed DWT-CNN-MCSE were measured using spectrogram analysis and word recognition rate (WRR). It was identified that the existing BAV-MCSE reported the highest WRR at 93.77% for a high SNR (at 20 dB) and 5.64% on average for a low SNR (at -10 dB) for different noises. The proposed DWT-CNN-MCSE system has proven to perform well at a low SNR with WRR of 70.55% and the highest improvement (64.91% WRR) at -10 dB SNR.

Collapse

Kim S, Athi M, Shi G, Kim M, Kristjansson T. Zero-shot test time adaptation via knowledge distillation for personalized speech denoising and dereverberation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024;155:1353-1367. [PMID: 38364043 DOI: 10.1121/10.0024621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 01/11/2024] [Indexed: 02/18/2024]

Wang D, Wang J, Sun M. 3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion. PLoS One 2024;19:e0289453. [PMID: 38285654 PMCID: PMC10824424 DOI: 10.1371/journal.pone.0289453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 01/04/2024] [Indexed: 01/31/2024] Open

Schuller BW, Akman A, Chang Y, Coppock H, Gebhard A, Kathan A, Rituerto-González E, Triantafyllopoulos A, Pokorny FB. Ecology & computer audition: Applications of audio technology to monitor organisms and environment. Heliyon 2024;10:e23142. [PMID: 38163154 PMCID: PMC10755287 DOI: 10.1016/j.heliyon.2023.e23142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 11/08/2023] [Accepted: 11/27/2023] [Indexed: 01/03/2024] Open

Luo X, Ke Y, Li X, Zheng C. On phase recovery and preserving early reflections for deep-learning speech dereverberation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024;155:436-451. [PMID: 38240664 DOI: 10.1121/10.0024348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Accepted: 12/21/2023] [Indexed: 01/23/2024]

Ha J, Baek SC, Lim Y, Chung JH. Validation of cost-efficient EEG experimental setup for neural tracking in an auditory attention task. Sci Rep 2023;13:22682. [PMID: 38114579 PMCID: PMC10730561 DOI: 10.1038/s41598-023-49990-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 12/14/2023] [Indexed: 12/21/2023] Open

Lin YM, Han JY, Lin CH, Lai YH. Optical Microphone-Based Speech Reconstruction System With Deep Learning for Individuals With Hearing Loss. IEEE Trans Biomed Eng 2023;70:3330-3341. [PMID: 37327105 DOI: 10.1109/tbme.2023.3285437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]

Fan C, Zhang H, Li A, Xiang W, Zheng C, Lv Z, Wu X. CompNet: Complementary network for single-channel speech enhancement. Neural Netw 2023;168:508-517. [PMID: 37832318 DOI: 10.1016/j.neunet.2023.09.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 08/18/2023] [Accepted: 09/24/2023] [Indexed: 10/15/2023]

Hou Z, Hu Q, Chen K, Cao Z, Lu J. Local spectral attention for full-band speech enhancement. JASA EXPRESS LETTERS 2023;3:115201. [PMID: 37916951 DOI: 10.1121/10.0022268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 10/19/2023] [Indexed: 11/03/2023]

Li G, Fu M, Sun M, Liu X, Zheng B. A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model. SENSORS (BASEL, SWITZERLAND) 2023;23:8770. [PMID: 37960477 PMCID: PMC10647675 DOI: 10.3390/s23218770] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 10/24/2023] [Accepted: 10/25/2023] [Indexed: 11/15/2023]

Triantafyllopoulos A, Kathan A, Baird A, Christ L, Gebhard A, Gerczuk M, Karas V, Hübner T, Jing X, Liu S, Mallol-Ragolta A, Milling M, Ottl S, Semertzidou A, Rajamani ST, Yan T, Yang Z, Dineley J, Amiriparian S, Bartl-Pokorny KD, Batliner A, Pokorny FB, Schuller BW. HEAR4Health: a blueprint for making computer audition a staple of modern healthcare. Front Digit Health 2023;5:1196079. [PMID: 37767523 PMCID: PMC10520966 DOI: 10.3389/fdgth.2023.1196079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 09/01/2023] [Indexed: 09/29/2023] Open

Affiliation(s)

Andreas Triantafyllopoulos EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Alexander Kathan EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Alice Baird EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Lukas Christ EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Alexander Gebhard EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Maurice Gerczuk EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Vincent Karas EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Tobias Hübner EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Xin Jing EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Shuo Liu EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Adria Mallol-Ragolta EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany Centre for Interdisciplinary Health Research, University of Augsburg, Augsburg, Germany
Manuel Milling EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Sandra Ottl EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Anastasia Semertzidou EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Srividya Tirunellai Rajamani EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Tianhao Yan EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Zijiang Yang EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Judith Dineley EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Shahin Amiriparian EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Katrin D. Bartl-Pokorny EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany Division of Phoniatrics, Medical University of Graz, Graz, Austria
Anton Batliner EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany
Florian B. Pokorny EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany Division of Phoniatrics, Medical University of Graz, Graz, Austria Centre for Interdisciplinary Health Research, University of Augsburg, Augsburg, Germany
Björn W. Schuller EIHW – Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany Centre for Interdisciplinary Health Research, University of Augsburg, Augsburg, Germany GLAM – Group on Language, Audio, & Music, Imperial College London, London, United Kingdom

Collapse

Henry F, Parsi A, Glavin M, Jones E. Experimental Investigation of Acoustic Features to Optimize Intelligibility in Cochlear Implants. SENSORS (BASEL, SWITZERLAND) 2023;23:7553. [PMID: 37688009 PMCID: PMC10490615 DOI: 10.3390/s23177553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Revised: 08/21/2023] [Accepted: 08/28/2023] [Indexed: 09/10/2023]

Yang Y, Pandey A, Wang D. Time-Domain Speech Enhancement for Robust Automatic Speech Recognition. INTERSPEECH 2023;2023:4913-4917. [PMID: 40313476 PMCID: PMC12045131 DOI: 10.21437/interspeech.2023-167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2025]

Peracha FK, Khattak MI, Salem N, Saleem N. Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network. PLoS One 2023;18:e0285629. [PMID: 37167227 PMCID: PMC10174555 DOI: 10.1371/journal.pone.0285629] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 04/26/2023] [Indexed: 05/13/2023] Open

Bellur A, Thakkar K, Elhilali M. Explicit-memory multiresolution adaptive framework for speech and music separation. EURASIP JOURNAL ON AUDIO, SPEECH, AND MUSIC PROCESSING 2023;2023:20. [PMID: 37181589 PMCID: PMC10169896 DOI: 10.1186/s13636-023-00286-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Accepted: 04/21/2023] [Indexed: 05/16/2023]

Healy EW, Johnson EM, Pandey A, Wang D. Progress made in the efficacy and viability of deep-learning-based noise reduction. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023;153:2751. [PMID: 37133814 PMCID: PMC10159658 DOI: 10.1121/10.0019341] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2022] [Revised: 04/17/2023] [Accepted: 04/17/2023] [Indexed: 05/04/2023]

Rascon C. Characterization of Deep Learning-Based Speech-Enhancement Techniques in Online Audio Processing Applications. SENSORS (BASEL, SWITZERLAND) 2023;23:s23094394. [PMID: 37177598 PMCID: PMC10181690 DOI: 10.3390/s23094394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 04/24/2023] [Accepted: 04/28/2023] [Indexed: 05/15/2023]

Chen H, Zhang X. CGA-MGAN: Metric GAN Based on Convolution-Augmented Gated Attention for Speech Enhancement. ENTROPY (BASEL, SWITZERLAND) 2023;25:e25040628. [PMID: 37190416 PMCID: PMC10137386 DOI: 10.3390/e25040628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 03/15/2023] [Accepted: 04/04/2023] [Indexed: 05/17/2023]

Pandey A, Wang D. Attentive Training: A New Training Framework for Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2023;31:1360-1370. [PMID: 37899765 PMCID: PMC10602021 DOI: 10.1109/taslp.2023.3260711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/31/2023]

Li F, Hu Y, Wang L. Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection. SENSORS (BASEL, SWITZERLAND) 2023;23:3015. [PMID: 36991724 PMCID: PMC10056690 DOI: 10.3390/s23063015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 02/27/2023] [Accepted: 03/09/2023] [Indexed: 06/19/2023]

Hu Q, Hou Z, Chen K, Lu J. Learnable spectral dimension compression mapping for full-band speech enhancement. JASA EXPRESS LETTERS 2023;3:025204. [PMID: 36858985 DOI: 10.1121/10.0017327] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]

Drgas S. A Survey on Low-Latency DNN-Based Speech Enhancement. SENSORS (BASEL, SWITZERLAND) 2023;23:1380. [PMID: 36772421 PMCID: PMC9921748 DOI: 10.3390/s23031380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Revised: 01/19/2023] [Accepted: 01/23/2023] [Indexed: 06/18/2023]

Zheng C, Zhang H, Liu W, Luo X, Li A, Li X, Moore BCJ. Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods. Trends Hear 2023;27:23312165231209913. [PMID: 37956661 PMCID: PMC10658184 DOI: 10.1177/23312165231209913] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 10/09/2023] [Indexed: 11/15/2023] Open

Liu TH, Chi JZ, Wu BL, Chen YS, Huang CH, Chu YS. Design and Implementation of Machine Tool Life Inspection System Based on Sound Sensing. SENSORS (BASEL, SWITZERLAND) 2022;23:284. [PMID: 36616882 PMCID: PMC9823646 DOI: 10.3390/s23010284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 12/15/2022] [Accepted: 12/23/2022] [Indexed: 06/17/2023]

Liu S, Mallol-Ragolta A, Parada-Cabaleiro E, Qian K, Jing X, Kathan A, Hu B, Schuller BW. Audio self-supervised learning: A survey. PATTERNS (NEW YORK, N.Y.) 2022;3:100616. [PMID: 36569546 PMCID: PMC9768631 DOI: 10.1016/j.patter.2022.100616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Hepsiba D, Justin J. Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN. Soft comput 2022. [DOI: 10.1007/s00500-021-06291-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order. Symmetry (Basel) 2022. [DOI: 10.3390/sym14122514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Multichannel KHMF for speech separation with enthalpy based DOA and score based CNN (SCNN). EVOLVING SYSTEMS 2022. [DOI: 10.1007/s12530-022-09473-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Šarić Z, Subotić M, Bilibajkić R, Barjaktarović M, Stojanović J. Supervised speech separation combined with adaptive beamforming. COMPUT SPEECH LANG 2022. [DOI: 10.1016/j.csl.2022.101409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Shi J, Chang X, Watanabe S, Xu B. Train from scratch: Single-stage joint training of speech separation and recognition. COMPUT SPEECH LANG 2022. [DOI: 10.1016/j.csl.2022.101387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Zhong D, Hu Y, Zhao K, Deng W, Hou P, Zhang J. Accurate separation of mixed high-dimension optical-chaotic signals using optical reservoir computing based on optically pumped VCSELs. OPTICS EXPRESS 2022;30:39561-39581. [PMID: 36298905 DOI: 10.1364/oe.470857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 09/28/2022] [Indexed: 06/16/2023]

Abstract

In this work, with the mixing fractions being known in advance or unknown, the schemes and theories for the separations of two groups of the mixed optical chaotic signals are proposed in detail, using the VCSEL-based reservoir computing (RC) systems. Here, two groups of the mixed optical chaotic signals are linearly combined with many beams of the chaotic x-polarization components (X-PCs) and Y-PCs emitted by the optically pumped spin-VCSELs operation alone. Two parallel reservoirs are performed by using the chaotic X-PC and Y-PC output by the optically pumped spin-VCSEL with both optical feedback and optical injection. Moreover, we further demonstrate the separation performances of the mixed chaotic signal linearly combined with no more than three beams of the chaotic X-PC or Y-PC. We find that two groups of the mixed optical chaos signals can be effectively separated by using two reservoirs in single RC system based on optically pumped Spin-VCSEL and their corresponding separated errors characterized by the training errors are no more than 0.093, when the mixing fractions are known as a certain value in advance. If the mixing fractions are unknown, we utilize two cascaded RC systems based on optically pumped Spin-VCSELs to separate each group of the mixed optical signals. The mixing fractions can be accurate predicted by using two parallel reservoirs in the first RC system. Based on the values of the predictive mixing fractions, two groups of the mixed optical chaos signals can be effectively separated by utilizing two parallel reservoirs in the second RC system, and their separated errors also are no more than 0.093. In the same way, the mixed optical chaos signal linearly superimposed with more than three beams of optical chaotic signals can be effectively separated. The method and idea for separation of complex optical chaos signals proposed by this paper may provide an impact to development of novel principles of multiple access and demultiplexing in multi-channel chaotic cryptography communication.

Collapse

Chou KF, Boyd AD, Best V, Colburn HS, Sen K. A biologically oriented algorithm for spatial sound segregation. Front Neurosci 2022;16:1004071. [PMID: 36312015 PMCID: PMC9614053 DOI: 10.3389/fnins.2022.1004071] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 09/28/2022] [Indexed: 11/13/2022] Open

Wang H, Zhang X, Wang D. Fusing Bone-conduction and Air-conduction Sensors for Complex-Domain Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2022;30:3134-3143. [PMID: 37124143 PMCID: PMC10147322 DOI: 10.1109/taslp.2022.3209943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]

Zhang K, Liu T, Song S, Zhao X, Sun S, Metzner W, Feng J, Liu Y. Separating overlapping bat calls with a bi-directional long short-term memory network. Integr Zool 2022;17:741-751. [PMID: 33881210 DOI: 10.1111/1749-4877.12549] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]

Toward Personalized Diagnosis and Therapy for Hearing Loss: Insights From Cochlear Implants. Otol Neurotol 2022;43:e903-e909. [PMID: 35970169 DOI: 10.1097/mao.0000000000003624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Lee GW, Kim HK. Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition. SENSORS (BASEL, SWITZERLAND) 2022;22:5381. [PMID: 35891070 PMCID: PMC9324918 DOI: 10.3390/s22145381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 07/16/2022] [Accepted: 07/17/2022] [Indexed: 06/15/2023]

Abstract

In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.

Collapse

A survey on deep reinforcement learning for audio-based applications. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10224-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]

HyVulDect: A Hybrid Semantic Vulnerability Mining System Based on Graph Neural Network. Comput Secur 2022. [DOI: 10.1016/j.cose.2022.102823] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]

Grumiaux PA, Kitić S, Girin L, Guérin A. A survey of sound source localization with deep learning methods. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022;152:107. [PMID: 35931500 DOI: 10.1121/10.0011809] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 06/06/2022] [Indexed: 06/15/2023]

Huang Y, Hao Y, Xu J, Xu B. Compressing speaker extraction model with ultra-low precision quantization and knowledge distillation. Neural Netw 2022;154:13-21. [PMID: 35841810 DOI: 10.1016/j.neunet.2022.06.026] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 04/20/2022] [Accepted: 06/21/2022] [Indexed: 11/25/2022]

A Multi-Source Separation Approach Based on DOA Cue and DNN. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12126224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]