51
|
Hao Y, Wu J, Huang X, Zhang Z, Liu F, Wu Q. Speaker extraction network with attention mechanism for speech dialogue system. SERVICE ORIENTED COMPUTING AND APPLICATIONS 2022. [DOI: 10.1007/s11761-022-00340-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
52
|
Telepresence Social Robotics towards Co-Presence: A Review. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12115557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Telepresence robots are becoming popular in social interactions involving health care, elderly assistance, guidance, or office meetings. There are two types of human psychological experiences to consider in robot-mediated interactions: (1) telepresence, in which a user develops a sense of being present near the remote interlocutor, and (2) co-presence, in which a user perceives the other person as being present locally with him or her. This work presents a literature review on developments supporting robotic social interactions, contributing to improving the sense of presence and co-presence via robot mediation. This survey aims to define social presence, co-presence, identify autonomous “user-adaptive systems” for social robots, and propose a taxonomy for “co-presence” mechanisms. It presents an overview of social robotics systems, applications areas, and technical methods and provides directions for telepresence and co-presence robot design given the actual and future challenges. Finally, we suggest evaluation guidelines for these systems, having as reference face-to-face interaction.
Collapse
|
53
|
Abstract
Most deep-learning-based multi-channel speech enhancement methods focus on designing a set of beamforming coefficients, to directly filter the low signal-to-noise ratio signals received by microphones, which hinders the performance of these approaches. To handle these problems, this paper designs a causal neural filter that fully exploits the spectro-temporal-spatial information in the beamspace domain. Specifically, multiple beams are designed to steer towards all directions, using a parameterized super-directive beamformer in the first stage. After that, a deep-learning-based filter is learned by, simultaneously, modeling the spectro-temporal-spatial discriminability of the speech and the interference, so as to extract the desired speech, coarsely, in the second stage. Finally, to further suppress the interference components, especially at low frequencies, a residual estimation module is adopted, to refine the output of the second stage. Experimental results demonstrate that the proposed approach outperforms many state-of-the-art (SOTA) multi-channel methods, on the generated multi-channel speech dataset based on the DNS-Challenge dataset.
Collapse
|
54
|
Wang H, Zhang X, Wang D. ATTENTION-BASED FUSION FOR BONE-CONDUCTED AND AIR-CONDUCTED SPEECH ENHANCEMENT IN THE COMPLEX DOMAIN. PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. ICASSP (CONFERENCE) 2022; 2022:7757-7761. [PMID: 40313328 PMCID: PMC12045135 DOI: 10.1109/icassp43922.2022.9746374] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2025]
Abstract
Bone-conduction (BC) microphones capture speech signals by converting the vibrations of the human skull into electrical signals. BC sensors are insensitive to acoustic noise, but limited in bandwidth. On the other hand, conventional or air-conduction (AC) microphones are capable of capturing full-band speech, but are susceptible to background noise. We propose to combine the strengths of AC and BC microphones by employing a convolutional recurrent network that performs complex spectral mapping. To better utilize signals from both kinds of microphone, we employ attention-based fusion with early-fusion and late-fusion strategies. Experiments demonstrate the superiority of the proposed method over other recent speech enhancement methods combining BC and AC signals. In addition, our enhancement performance is significantly better than conventional speech enhancement counterparts, especially in low signal-to-noise ratio scenarios.
Collapse
Affiliation(s)
- Heming Wang
- Department of Computer Science and Engineering, The Ohio State University, USA
| | - Xueliang Zhang
- Department of Computer Science, Inner Mongolia University, China
| | - DeLiang Wang
- Department of Computer Science and Engineering, The Ohio State University, USA
- Center for Cognitive and Brain Sciences, The Ohio State University, USA
| |
Collapse
|
55
|
Chi NA, Washington P, Kline A, Husic A, Hou C, He C, Dunlap K, Wall DP. Classifying Autism From Crowdsourced Semistructured Speech Recordings: Machine Learning Model Comparison Study. JMIR Pediatr Parent 2022; 5:e35406. [PMID: 35436234 PMCID: PMC9052034 DOI: 10.2196/35406] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/18/2022] [Accepted: 01/25/2022] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Autism spectrum disorder (ASD) is a neurodevelopmental disorder that results in altered behavior, social development, and communication patterns. In recent years, autism prevalence has tripled, with 1 in 44 children now affected. Given that traditional diagnosis is a lengthy, labor-intensive process that requires the work of trained physicians, significant attention has been given to developing systems that automatically detect autism. We work toward this goal by analyzing audio data, as prosody abnormalities are a signal of autism, with affected children displaying speech idiosyncrasies such as echolalia, monotonous intonation, atypical pitch, and irregular linguistic stress patterns. OBJECTIVE We aimed to test the ability for machine learning approaches to aid in detection of autism in self-recorded speech audio captured from children with ASD and neurotypical (NT) children in their home environments. METHODS We considered three methods to detect autism in child speech: (1) random forests trained on extracted audio features (including Mel-frequency cepstral coefficients); (2) convolutional neural networks trained on spectrograms; and (3) fine-tuned wav2vec 2.0-a state-of-the-art transformer-based speech recognition model. We trained our classifiers on our novel data set of cellphone-recorded child speech audio curated from the Guess What? mobile game, an app designed to crowdsource videos of children with ASD and NT children in a natural home environment. RESULTS The random forest classifier achieved 70% accuracy, the fine-tuned wav2vec 2.0 model achieved 77% accuracy, and the convolutional neural network achieved 79% accuracy when classifying children's audio as either ASD or NT. We used 5-fold cross-validation to evaluate model performance. CONCLUSIONS Our models were able to predict autism status when trained on a varied selection of home audio clips with inconsistent recording qualities, which may be more representative of real-world conditions. The results demonstrate that machine learning methods offer promise in detecting autism automatically from speech without specialized equipment.
Collapse
Affiliation(s)
- Nathan A Chi
- Division of Systems Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA, United States
| | - Peter Washington
- Department of Bioengineering, Stanford University, Stanford, CA, United States
| | - Aaron Kline
- Division of Systems Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA, United States
| | - Arman Husic
- Division of Systems Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA, United States
| | - Cathy Hou
- Department of Computer Science, Stanford University, Stanford, CA, United States
| | - Chloe He
- Department of Biomedical Data Science, Stanford University, Stanford, CA, United States
| | - Kaitlyn Dunlap
- Division of Systems Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA, United States
| | - Dennis P Wall
- Division of Systems Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA, United States
- Department of Biomedical Data Science, Stanford University, Stanford, CA, United States
- Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, United States
| |
Collapse
|
56
|
Mehra FP, Verma SDSK. BERIS: An mBERT Based Emotion Recognition Algorithm from Indian Speech. ACM T ASIAN LOW-RESO 2022. [DOI: 10.1145/3517195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Emotions, the building blocks of the human intellect, play a vital role in Artificial Intelligence. For a robust AI based machine, it is important that the machine understands human emotions. COVID-19 has introduced the world to no touch intelligent systems. With an influx of users, it's critical to create devices that can communicate in a local dialect. A multilingual system is required in countries like India, which have a large population and a diverse range of languages. Given the importance of multilingual emotion recognition, this research introduces BERIS, an Indian language emotion detection system. From the Indian sound recording, BERIS estimates both acoustic and textual characteristics. To extract the textual features, we used Multilingual Bidirectional Encoder Representations from Transformers (mBERT). For acoustics, BERIS computes the Mel Frequency Cepstral Coefficients (MFCC) and Linear Prediction coefficients (LPC), and Pitch. The features extracted are merged in a linear array. Since the dialogues are of varied lengths, the data is normalized to have arrays of equal length. Finally, we split the data into training and validated set to construct a predictive model. The model can predict emotions from the new input. On all the datasets presented, quantitative and qualitative evaluations show that the proposed algorithm outperforms state-of-the-art approaches.
Collapse
|
57
|
Pandey A, Wang D. Self-attending RNN for Speech Enhancement to Improve Cross-corpus Generalization. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2022; 30:1374-1385. [PMID: 36245814 PMCID: PMC9560045 DOI: 10.1109/taslp.2022.3161143] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Deep neural networks (DNNs) represent the mainstream methodology for supervised speech enhancement, primarily due to their capability to model complex functions using hierarchical representations. However, a recent study revealed that DNNs trained on a single corpus fail to generalize to untrained corpora, especially in low signal-to-noise ratio (SNR) conditions. Developing a noise, speaker, and corpus independent speech enhancement algorithm is essential for real-world applications. In this study, we propose a self-attending recurrent neural network (SARNN) for time-domain speech enhancement to improve cross-corpus generalization. SARNN comprises of recurrent neural networks (RNNs) augmented with self-attention blocks and feedforward blocks. We evaluate SARNN on different corpora with nonstationary noises in low SNR conditions. Experimental results demonstrate that SARNN substantially outperforms competitive approaches to time-domain speech enhancement, such as RNNs and dual-path SARNNs. Additionally, we report an important finding that the two popular approaches to speech enhancement: complex spectral mapping and time-domain enhancement, obtain similar results for RNN and SARNN with large-scale training. We also provide a challenging subset of the test set used in this study for evaluating future algorithms and facilitating direct comparisons.
Collapse
Affiliation(s)
- Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
58
|
Purushothaman A, Sreeram A, Kumar R, Ganapathy S. Dereverberation of autoregressive envelopes for far-field speech recognition. COMPUT SPEECH LANG 2022. [DOI: 10.1016/j.csl.2021.101277] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
59
|
A context aware-based deep neural network approach for simultaneous speech denoising and dereverberation. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-06968-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
60
|
Sound Source Separation Mechanisms of Different Deep Networks Explained from the Perspective of Auditory Perception. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12020832] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Thanks to the development of deep learning, various sound source separation networks have been proposed and made significant progress. However, the study on the underlying separation mechanisms is still in its infancy. In this study, deep networks are explained from the perspective of auditory perception mechanisms. For separating two arbitrary sound sources from monaural recordings, three different networks with different parameters are trained and achieve excellent performances. The networks’ output can obtain an average scale-invariant signal-to-distortion ratio improvement (SI-SDRi) higher than 10 dB, comparable with the human performance to separate natural sources. More importantly, the most intuitive principle—proximity—is explored through simultaneous and sequential organization experiments. Results show that regardless of network structures and parameters, the proximity principle is learned spontaneously by all networks. If components are proximate in frequency or time, they are not easily separated by networks. Moreover, the frequency resolution at low frequencies is better than at high frequencies. These behavior characteristics of all three networks are highly consistent with those of the human auditory system, which implies that the learned proximity principle is not accidental, but the optimal strategy selected by networks and humans when facing the same task. The emergence of the auditory-like separation mechanisms provides the possibility to develop a universal system that can be adapted to all sources and scenes.
Collapse
|
61
|
Routray S, Mao Q. Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network. COMPUT SPEECH LANG 2022. [DOI: 10.1016/j.csl.2021.101270] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
62
|
Wang H, Wang D. Neural Cascade Architecture with Triple-domain Loss for Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2021; 30:734-743. [PMID: 36161036 PMCID: PMC9491518 DOI: 10.1109/taslp.2021.3138716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
This paper proposes a neural cascade architecture to address the monaural speech enhancement problem. The cascade architecture is composed of three modules which optimize in turn enhanced speech with respect to the magnitude spectrogram, the time-domain signal and the complex spectrogram. Each module takes as input the noisy speech and the output obtained from the previous module, and generates a prediction of the respective target. Our model is trained in an end-to-end manner, using a triple-domain loss function that accounts for three domains of signal representation. Experimental results on the WSJ0 SI-84 corpus show that the proposed model outperforms other strong speech enhancement baselines in terms of objective speech quality and intelligibility.
Collapse
Affiliation(s)
- Heming Wang
- Department of Computer Science and Engineering, The Ohio State University, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
63
|
Sainburg T, Gentner TQ. Toward a Computational Neuroethology of Vocal Communication: From Bioacoustics to Neurophysiology, Emerging Tools and Future Directions. Front Behav Neurosci 2021; 15:811737. [PMID: 34987365 PMCID: PMC8721140 DOI: 10.3389/fnbeh.2021.811737] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 11/29/2021] [Indexed: 11/23/2022] Open
Abstract
Recently developed methods in computational neuroethology have enabled increasingly detailed and comprehensive quantification of animal movements and behavioral kinematics. Vocal communication behavior is well poised for application of similar large-scale quantification methods in the service of physiological and ethological studies. This review describes emerging techniques that can be applied to acoustic and vocal communication signals with the goal of enabling study beyond a small number of model species. We review a range of modern computational methods for bioacoustics, signal processing, and brain-behavior mapping. Along with a discussion of recent advances and techniques, we include challenges and broader goals in establishing a framework for the computational neuroethology of vocal communication.
Collapse
Affiliation(s)
- Tim Sainburg
- Department of Psychology, University of California, San Diego, La Jolla, CA, United States
- Center for Academic Research & Training in Anthropogeny, University of California, San Diego, La Jolla, CA, United States
| | - Timothy Q. Gentner
- Department of Psychology, University of California, San Diego, La Jolla, CA, United States
- Neurosciences Graduate Program, University of California, San Diego, La Jolla, CA, United States
- Neurobiology Section, Division of Biological Sciences, University of California, San Diego, La Jolla, CA, United States
- Kavli Institute for Brain and Mind, University of California, San Diego, La Jolla, CA, United States
| |
Collapse
|
64
|
Zheng Q, Yang M, Wang D, Tian X, Su H. An intelligent wireless communication model based on multi-feature fusion and quantile regression neural network. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2021. [DOI: 10.3233/jifs-202430] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Throughout the wireless communication network planning process, efficient signal reception power estimation is of great significance for accurate 5 G network deployment. The wireless propagation model predicts the radio wave propagation characteristics within the target communication coverage area, making it possible to estimate cell coverage, inter-cell network interference, and communication rates, etc. In this paper, we develop a series of features by considering various factors in the signal transmission process, including the shadow coefficient, absorption coefficient in test area and base station area, distance attenuation coefficient, density, azimuth angle, relative height and ground feature index coefficient. Then we design a quantile regression neural network to predict reference signal receiving power (RSRP) by feeding the above features. The network structure is specially constructed to be generalized on various complex real environments. To prove the effectiveness of proposed features and deep learning model, extensive comparative ablation experiments are applied. Finally, we have achieved the precision rate (PR), recall rate (RR), and inadequate coverage recognition rate (PCRR) of 84.3%, 78.4%, and 81.2% on the public dataset, respectively. The comparison with a series of state-of-the-art machine learning methods illustrates the superiority of the proposed method.
Collapse
Affiliation(s)
- Qinghe Zheng
- School of Information Science and Engineering, Shandong University, Qingdao, China
| | - Mingqiang Yang
- School of Information Science and Engineering, Shandong University, Qingdao, China
| | - Deqiang Wang
- School of Information Science and Engineering, Shandong University, Qingdao, China
| | - Xinyu Tian
- School of Intelligent Engineering, Shandong Management University, Jinan, China
| | - Huake Su
- School of Microelectronics, Xidian University, Xian, China
| |
Collapse
|
65
|
Bermant PC. BioCPPNet: automatic bioacoustic source separation with deep neural networks. Sci Rep 2021; 11:23502. [PMID: 34873197 PMCID: PMC8648737 DOI: 10.1038/s41598-021-02790-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 11/16/2021] [Indexed: 11/09/2022] Open
Abstract
We introduce the Bioacoustic Cocktail Party Problem Network (BioCPPNet), a lightweight, modular, and robust U-Net-based machine learning architecture optimized for bioacoustic source separation across diverse biological taxa. Employing learnable or handcrafted encoders, BioCPPNet operates directly on the raw acoustic mixture waveform containing overlapping vocalizations and separates the input waveform into estimates corresponding to the sources in the mixture. Predictions are compared to the reference ground truth waveforms by searching over the space of (output, target) source order permutations, and we train using an objective function motivated by perceptual audio quality. We apply BioCPPNet to several species with unique vocal behavior, including macaques, bottlenose dolphins, and Egyptian fruit bats, and we evaluate reconstruction quality of separated waveforms using the scale-invariant signal-to-distortion ratio (SI-SDR) and downstream identity classification accuracy. We consider mixtures with two or three concurrent conspecific vocalizers, and we examine separation performance in open and closed speaker scenarios. To our knowledge, this paper redefines the state-of-the-art in end-to-end single-channel bioacoustic source separation in a permutation-invariant regime across a heterogeneous set of non-human species. This study serves as a major step toward the deployment of bioacoustic source separation systems for processing substantial volumes of previously unusable data containing overlapping bioacoustic signals.
Collapse
|
66
|
Tseng RY, Wang TW, Fu SW, Lee CY, Tsao Y. A Study of Joint Effect on Denoising Techniques and Visual Cues to Improve Speech Intelligibility in Cochlear Implant Simulation. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2020.3017042] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
67
|
Drgas S, Virtanen T. Joint speaker separation and recognition using non-negative matrix deconvolution with adaptive dictionary. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2021.101223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
68
|
Atkins A, Cohen I, Benesty J. Adaptive line enhancer for nonstationary harmonic noise reduction. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2021.101245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
69
|
Abstract
This study analyses the main challenges, trends, technological approaches, and artificial intelligence methods developed by new researchers and professionals in the field of machine learning, with an emphasis on the most outstanding and relevant works to date. This literature review evaluates the main methodological contributions of artificial intelligence through machine learning. The methodology used to study the documents was content analysis; the basic terminology of the study corresponds to machine learning, artificial intelligence, and big data between the years 2017 and 2021. For this study, we selected 181 references, of which 120 are part of the literature review. The conceptual framework includes 12 categories, four groups, and eight subgroups. The study of data management using AI methodologies presents symmetry in the four machine learning groups: supervised learning, unsupervised learning, semi-supervised learning, and reinforced learning. Furthermore, the artificial intelligence methods with more symmetry in all groups are artificial neural networks, Support Vector Machines, K-means, and Bayesian Methods. Finally, five research avenues are presented to improve the prediction of machine learning.
Collapse
|
70
|
Li LPH, Han JY, Zheng WZ, Huang RJ, Lai YH. Improved Environment-Aware-Based Noise Reduction System for Cochlear Implant Users Based on a Knowledge Transfer Approach: Development and Usability Study. J Med Internet Res 2021; 23:e25460. [PMID: 34709193 PMCID: PMC8587190 DOI: 10.2196/25460] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 02/11/2021] [Accepted: 04/27/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Cochlear implant technology is a well-known approach to help deaf individuals hear speech again and can improve speech intelligibility in quiet conditions; however, it still has room for improvement in noisy conditions. More recently, it has been proven that deep learning-based noise reduction, such as noise classification and deep denoising autoencoder (NC+DDAE), can benefit the intelligibility performance of patients with cochlear implants compared to classical noise reduction algorithms. OBJECTIVE Following the successful implementation of the NC+DDAE model in our previous study, this study aimed to propose an advanced noise reduction system using knowledge transfer technology, called NC+DDAE_T; examine the proposed NC+DDAE_T noise reduction system using objective evaluations and subjective listening tests; and investigate which layer substitution of the knowledge transfer technology in the NC+DDAE_T noise reduction system provides the best outcome. METHODS The knowledge transfer technology was adopted to reduce the number of parameters of the NC+DDAE_T compared with the NC+DDAE. We investigated which layer should be substituted using short-time objective intelligibility and perceptual evaluation of speech quality scores as well as t-distributed stochastic neighbor embedding to visualize the features in each model layer. Moreover, we enrolled 10 cochlear implant users for listening tests to evaluate the benefits of the newly developed NC+DDAE_T. RESULTS The experimental results showed that substituting the middle layer (ie, the second layer in this study) of the noise-independent DDAE (NI-DDAE) model achieved the best performance gain regarding short-time objective intelligibility and perceptual evaluation of speech quality scores. Therefore, the parameters of layer 3 in the NI-DDAE were chosen to be replaced, thereby establishing the NC+DDAE_T. Both objective and listening test results showed that the proposed NC+DDAE_T noise reduction system achieved similar performances compared with the previous NC+DDAE in several noisy test conditions. However, the proposed NC+DDAE_T only required a quarter of the number of parameters compared to the NC+DDAE. CONCLUSIONS This study demonstrated that knowledge transfer technology can help reduce the number of parameters in an NC+DDAE while keeping similar performance rates. This suggests that the proposed NC+DDAE_T model may reduce the implementation costs of this noise reduction system and provide more benefits for cochlear implant users.
Collapse
Affiliation(s)
- Lieber Po-Hung Li
- Department of Otolaryngology, Cheng Hsin General Hospital, Taipei, Taiwan.,Faculty of Medicine, Institute of Brain Science, National Yang Ming Chiao Tung University, Taipei, Taiwan.,Department of Medical Research, China Medical University Hospital, China Medical University, Taichung, Taiwan.,Department of Speech Language Pathology and Audiology, College of Health Technology, National Taipei University of Nursing and Health Sciences, Taipei, Taiwan
| | - Ji-Yan Han
- Department of Biomedical Engineering, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Wei-Zhong Zheng
- Department of Biomedical Engineering, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Ren-Jie Huang
- Department of Biomedical Engineering, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Ying-Hui Lai
- Department of Biomedical Engineering, National Yang Ming Chiao Tung University, Taipei, Taiwan
| |
Collapse
|
71
|
Harnessing the power of artificial intelligence to transform hearing healthcare and research. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-021-00394-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
72
|
Lv Z, Li J, Dong C, Li H, Xu Z. Deep learning in the COVID-19 epidemic: A deep model for urban traffic revitalization index. DATA KNOWL ENG 2021; 135:101912. [PMID: 34602688 PMCID: PMC8473779 DOI: 10.1016/j.datak.2021.101912] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Revised: 05/14/2021] [Accepted: 06/29/2021] [Indexed: 11/26/2022]
Abstract
The research of traffic revitalization index can provide support for the formulation and adjustment of policies related to urban management, epidemic prevention and resumption of work and production. This paper proposes a deep model for the prediction of urban Traffic Revitalization Index (DeepTRI). The DeepTRI builds model for the data of COVID-19 epidemic and traffic revitalization index for major cities in China. The location information of 29 cities forms the topological structure of graph. The Spatial Convolution Layer proposed in this paper captures the spatial correlation features of the graph structure. The special Graph Data Fusion module distributes and fuses the two kinds of data according to different proportions to increase the trend of spatial correlation of the data. In order to reduce the complexity of the computational process, the Temporal Convolution Layer replaces the gated recursive mechanism of the traditional recurrent neural network with a multi-level residual structure. It uses the dilated convolution whose dilation factor changes according to convex function to control the dynamic change of the receptive field and uses causal convolution to fully mine the historical information of the data to optimize the ability of long-term prediction. The comparative experiments among DeepTRI and three baselines (traditional recurrent neural network, ordinary spatial–temporal model and graph spatial–temporal model) show the advantages of DeepTRI in the evaluation index and resolving two under-fitting problems (under-fitting of edge values and under-fitting of local peaks).
Collapse
Affiliation(s)
- Zhiqiang Lv
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China
| | - Jianbo Li
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China
| | - Chuanhao Dong
- Institute of Ubiquitous Networks and Urban Computing, Qingdao 266070, China
| | - Haoran Li
- Institute of Ubiquitous Networks and Urban Computing, Qingdao 266070, China
| | - Zhihao Xu
- Institute of Ubiquitous Networks and Urban Computing, Qingdao 266070, China
| |
Collapse
|
73
|
Abdi M, Feng X, Sun C, Bilchick KC, Meyer CH, Epstein FH. Suppression of artifact-generating echoes in cine DENSE using deep learning. Magn Reson Med 2021; 86:2095-2104. [PMID: 34021628 PMCID: PMC8295221 DOI: 10.1002/mrm.28832] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Revised: 03/21/2021] [Accepted: 04/17/2021] [Indexed: 11/11/2022]
Abstract
PURPOSE To use deep learning for suppression of the artifact-generating T1 -relaxation echo in cine displacement encoding with stimulated echoes (DENSE) for the purpose of reducing the scan time. METHODS A U-Net was trained to suppress the artifact-generating T1 -relaxation echo using complementary phase-cycled data as the ground truth. A data-augmentation method was developed that generates synthetic DENSE images with arbitrary displacement-encoding frequencies to suppress the T1 -relaxation echo modulated for a range of frequencies. The resulting U-Net (DAS-Net) was compared with k-space zero-filling as an alternative method. Non-phase-cycled DENSE images acquired in shorter breath-holds were processed by DAS-Net and compared with DENSE images acquired with phase cycling for the quantification of myocardial strain. RESULTS The DAS-Net method effectively suppressed the T1 -relaxation echo and its artifacts, and achieved root Mean Square(RMS) error = 5.5 ± 0.8 and structural similarity index = 0.85 ± 0.02 for DENSE images acquired with a displacement encoding frequency of 0.10 cycles/mm. The DAS-Net method outperformed zero-filling (root Mean Square error = 5.8 ± 1.5 vs 13.5 ± 1.5, DAS-Net vs zero-filling, P < .01; and structural similarity index = 0.83 ± 0.04 vs 0.66 ± 0.03, DAS-Net vs zero-filling, P < .01). Strain data for non-phase-cycled DENSE images with DAS-Net showed close agreement with strain from phase-cycled DENSE. CONCLUSION The DAS-Net method provides an effective alternative approach for suppression of the artifact-generating T1 -relaxation echo in DENSE MRI, enabling a 42% reduction in scan time compared to DENSE with phase-cycling.
Collapse
Affiliation(s)
- Mohamad Abdi
- Departments of Biomedical Engineering, University of Virginia, Charlottesville, Virginia
| | - Xue Feng
- Departments of Biomedical Engineering, University of Virginia, Charlottesville, Virginia
| | - Changyu Sun
- Departments of Biomedical Engineering, University of Virginia, Charlottesville, Virginia
| | - Kenneth C. Bilchick
- Department of Medicine, University of Virginia Health System, Charlottesville, Virginia
| | - Craig H. Meyer
- Departments of Biomedical Engineering, University of Virginia, Charlottesville, Virginia
- Departments of Radiology, University of Virginia Health System, Charlottesville, Virginia
| | - Frederick H. Epstein
- Departments of Biomedical Engineering, University of Virginia, Charlottesville, Virginia
- Departments of Radiology, University of Virginia Health System, Charlottesville, Virginia
| |
Collapse
|
74
|
Viswanathan V, Bharadwaj HM, Shinn-Cunningham BG, Heinz MG. Modulation masking and fine structure shape neural envelope coding to predict speech intelligibility across diverse listening conditions. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:2230. [PMID: 34598642 PMCID: PMC8483789 DOI: 10.1121/10.0006385] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 07/22/2021] [Accepted: 08/30/2021] [Indexed: 05/28/2023]
Abstract
A fundamental question in the neuroscience of everyday communication is how scene acoustics shape the neural processing of attended speech sounds and in turn impact speech intelligibility. While it is well known that the temporal envelopes in target speech are important for intelligibility, how the neural encoding of target-speech envelopes is influenced by background sounds or other acoustic features of the scene is unknown. Here, we combine human electroencephalography with simultaneous intelligibility measurements to address this key gap. We find that the neural envelope-domain signal-to-noise ratio in target-speech encoding, which is shaped by masker modulations, predicts intelligibility over a range of strategically chosen realistic listening conditions unseen by the predictive model. This provides neurophysiological evidence for modulation masking. Moreover, using high-resolution vocoding to carefully control peripheral envelopes, we show that target-envelope coding fidelity in the brain depends not only on envelopes conveyed by the cochlea, but also on the temporal fine structure (TFS), which supports scene segregation. Our results are consistent with the notion that temporal coherence of sound elements across envelopes and/or TFS influences scene analysis and attentive selection of a target sound. Our findings also inform speech-intelligibility models and technologies attempting to improve real-world speech communication.
Collapse
Affiliation(s)
- Vibha Viswanathan
- Weldon School of Biomedical Engineering, Purdue University, West Lafayette, Indiana 47907, USA
| | - Hari M Bharadwaj
- Department of Speech, Language, and Hearing Sciences, Purdue University, West Lafayette, Indiana 47907, USA
| | | | - Michael G Heinz
- Department of Speech, Language, and Hearing Sciences, Purdue University, West Lafayette, Indiana 47907, USA
| |
Collapse
|
75
|
Kuruvila I, Muncke J, Fischer E, Hoppe U. Extracting the Auditory Attention in a Dual-Speaker Scenario From EEG Using a Joint CNN-LSTM Model. Front Physiol 2021; 12:700655. [PMID: 34408661 PMCID: PMC8365753 DOI: 10.3389/fphys.2021.700655] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Accepted: 07/05/2021] [Indexed: 11/25/2022] Open
Abstract
Human brain performs remarkably well in segregating a particular speaker from interfering ones in a multispeaker scenario. We can quantitatively evaluate the segregation capability by modeling a relationship between the speech signals present in an auditory scene, and the listener's cortical signals measured using electroencephalography (EEG). This has opened up avenues to integrate neuro-feedback into hearing aids where the device can infer user's attention and enhance the attended speaker. Commonly used algorithms to infer the auditory attention are based on linear systems theory where cues such as speech envelopes are mapped on to the EEG signals. Here, we present a joint convolutional neural network (CNN)—long short-term memory (LSTM) model to infer the auditory attention. Our joint CNN-LSTM model takes the EEG signals and the spectrogram of the multiple speakers as inputs and classifies the attention to one of the speakers. We evaluated the reliability of our network using three different datasets comprising of 61 subjects, where each subject undertook a dual-speaker experiment. The three datasets analyzed corresponded to speech stimuli presented in three different languages namely German, Danish, and Dutch. Using the proposed joint CNN-LSTM model, we obtained a median decoding accuracy of 77.2% at a trial duration of 3 s. Furthermore, we evaluated the amount of sparsity that the model can tolerate by means of magnitude pruning and found a tolerance of up to 50% sparsity without substantial loss of decoding accuracy.
Collapse
Affiliation(s)
- Ivine Kuruvila
- Department of Audiology, ENT-Clinic, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Jan Muncke
- Department of Audiology, ENT-Clinic, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | | | - Ulrich Hoppe
- Department of Audiology, ENT-Clinic, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| |
Collapse
|
76
|
Geravanchizadeh M, Zakeri S. Ear-EEG-based binaural speech enhancement (ee-BSE) using auditory attention detection and audiometric characteristics of hearing-impaired subjects. J Neural Eng 2021; 18. [PMID: 34289464 DOI: 10.1088/1741-2552/ac16b4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 07/21/2021] [Indexed: 11/11/2022]
Abstract
Objective. Speech perception in cocktail party scenarios has been the concern of a group of researchers who are involved with the design of hearing-aid devices.Approach. In this paper, a new unified ear-EEG-based binaural speech enhancement system is introduced for hearing-impaired (HI) listeners. The proposed model, which is based on auditory attention detection (AAD) and individual hearing threshold (HT) characteristics, has four main processing stages. In the binaural processing stage, a system based on the deep neural network is trained to estimate auditory ratio masks for each of the speakers in the mixture signal. In the EEG processing stage, AAD is employed to select one ratio mask corresponding to the attended speech. Here, the same EEG data is also used to predict the HTs of listeners who participated in the EEG recordings. The third stage, called insertion gain computation, concerns the calculation of a special amplification gain based on individual HTs. Finally, in the selection-resynthesis-amplification stage, the attended speech signals of the target are resynthesized based on the selected auditory mask and then are amplified using the computed insertion gain.Main results. The detection of the attended speech and the HTs are achieved by classifiers that are trained with features extracted from the scalp EEG or the ear EEG signals. The results of evaluating AAD and HT detection show high detection accuracies. The systematic evaluations of the proposed system yield substantial intelligibility and quality improvements for the HI and normal-hearingaudiograms.Significance. The AAD method determines the direction of attention from single-trial EEG signals without access to audio signals of the speakers. The amplification procedure could be adjusted for each subject based on the individual HTs. The present model has the potential to be considered as an important processing tool to personalize the neuro-steered hearing aids.
Collapse
Affiliation(s)
- Masoud Geravanchizadeh
- Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz 51666-15813, Iran
| | - Sahar Zakeri
- Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz 51666-15813, Iran
| |
Collapse
|
77
|
Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients. Neural Comput Appl 2021. [DOI: 10.1007/s00521-021-05782-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
78
|
Cheng L, Peng R, Li A, Zheng C, Li X. Deep learning-based stereophonic acoustic echo suppression without decorrelation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:816. [PMID: 34470328 DOI: 10.1121/10.0005757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 07/08/2021] [Indexed: 06/13/2023]
Abstract
Traditional stereophonic acoustic echo cancellation algorithms need to estimate acoustic echo paths from stereo loudspeakers to a microphone, which often suffers from the nonuniqueness problem caused by a high correlation between the two far-end signals of these stereo loudspeakers. Many decorrelation methods have already been proposed to mitigate this problem. However, these methods may reduce the audio quality and/or stereophonic spatial perception. This paper proposes to use a convolutional recurrent network (CRN) to suppress the stereophonic echo components by estimating a nonlinear gain, which is then multiplied by the complex spectrum of the microphone signal to obtain the estimated near-end speech without a decorrelation procedure. The CRN includes an encoder-decoder module and two-layer gated recurrent network module, which can take advantage of the feature extraction capability of the convolutional neural networks and temporal modeling capability of recurrent neural networks simultaneously. The magnitude spectra of the two far-end signals are used as input features directly without any decorrelation preprocessing and, thus, both the audio quality and stereophonic spatial perception can be maintained. The experimental results in both the simulated and real acoustic environments show that the proposed algorithm outperforms traditional algorithms such as the normalized least-mean square and Wiener algorithms, especially in situations of low signal-to-echo ratio and high reverberation time RT60.
Collapse
Affiliation(s)
- Linjuan Cheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China
| | - Renhua Peng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China
| | - Andong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China
| | - Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China
| | - Xiaodong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China
| |
Collapse
|
79
|
Behavioral Pattern Analysis between Bilingual and Monolingual Listeners’ Natural Speech Perception on Foreign-Accented English Language Using Different Machine Learning Approaches. TECHNOLOGIES 2021. [DOI: 10.3390/technologies9030051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Speech perception in an adverse background/noisy environment is a complex and challenging human process, which is made even more complicated in foreign-accented language for bilingual and monolingual individuals. Listeners who have difficulties in hearing are affected most by such a situation. Despite considerable efforts, the increase in speech intelligibility in noise remains elusive. Considering this opportunity, this study investigates Bengali–English bilinguals and native American English monolinguals’ behavioral patterns on foreign-accented English language considering bubble noise, gaussian or white noise, and quiet sound level. Twelve regular hearing participants (Six Bengali–English bilinguals and Six Native American English monolinguals) joined in this study. Statistical computation shows that speech with different noise has a significant effect (p = 0.009) on listening for both bilingual and monolingual under different sound levels (e.g., 55 dB, 65 dB, and 75 dB). Here, six different machine learning approaches (Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-nearest neighbors (KNN), Naïve Bayes (NB), Classification and regression trees (CART), and Support vector machine (SVM)) are tested and evaluated to differentiate between bilingual and monolingual individuals from their behavioral patterns in both noisy and quiet environments. Results show that most optimal performances were observed using LDA by successfully differentiating between bilingual and monolingual 60% of the time. A deep neural network-based model is proposed to improve this measure further and achieved an accuracy of nearly 100% in successfully differentiating between bilingual and monolingual individuals.
Collapse
|
80
|
Zhao M, Yao X, Wang J, Yan Y, Gao X, Fan Y. Single-Channel Blind Source Separation of Spatial Aliasing Signal Based on Stacked-LSTM. SENSORS (BASEL, SWITZERLAND) 2021; 21:4844. [PMID: 34300584 PMCID: PMC8309757 DOI: 10.3390/s21144844] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 07/10/2021] [Accepted: 07/11/2021] [Indexed: 11/16/2022]
Abstract
Aiming at the problem of insufficient separation accuracy of aliased signals in space Internet satellite-ground communication scenarios, a stacked long short-term memory network (Stacked-LSTM) separation method based on deep learning is proposed. First, the coding feature representation of the mixed signal is extracted. Then, the long sequence input is divided into smaller blocks through the Stacked-LSTM network with the attention mechanism of the SE module, and the deep feature mask of the source signal is trained to obtain the Hadamard product of the mask of each source and the coding feature of the mixed signal, which is the encoding feature representation of the source signal. Finally, characteristics of the source signal is decoded by 1-D convolution to to obtain the original waveform. The negative scale-invariant source-to-noise ratio (SISNR) is used as the loss function of network training, that is, the evaluation index of single-channel blind source separation performance. The results show that in the single-channel separation of spatially aliased signals, the Stacked-LSTM method improves SISNR by 10.09∼38.17 dB compared with the two classic separation algorithms of ICA and NMF and the three deep learning separation methods of TasNet, Conv-TasNet and Wave-U-Net. The Stacked-LSTM method has better separation accuracy and noise robustness.
Collapse
Affiliation(s)
- Mengchen Zhao
- National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China; (M.Z.); (J.W.); (Y.Y.); (X.G.); (Y.F.)
- University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiujuan Yao
- National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China; (M.Z.); (J.W.); (Y.Y.); (X.G.); (Y.F.)
| | - Jing Wang
- National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China; (M.Z.); (J.W.); (Y.Y.); (X.G.); (Y.F.)
| | - Yi Yan
- National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China; (M.Z.); (J.W.); (Y.Y.); (X.G.); (Y.F.)
| | - Xiang Gao
- National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China; (M.Z.); (J.W.); (Y.Y.); (X.G.); (Y.F.)
| | - Yanan Fan
- National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China; (M.Z.); (J.W.); (Y.Y.); (X.G.); (Y.F.)
| |
Collapse
|
81
|
|
82
|
Potential of Augmented Reality Platforms to Improve Individual Hearing Aids and to Support More Ecologically Valid Research. Ear Hear 2021; 41 Suppl 1:140S-146S. [PMID: 33105268 PMCID: PMC7676615 DOI: 10.1097/aud.0000000000000961] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
An augmented reality (AR) platform combines several technologies in a system that can render individual “digital objects” that can be manipulated for a given purpose. In the audio domain, these may, for example, be generated by speaker separation, noise suppression, and signal enhancement. Access to the “digital objects” could be used to augment auditory objects that the user wants to hear better. Such AR platforms in conjunction with traditional hearing aids may contribute to closing the gap for people with hearing loss through multimodal sensor integration, leveraging extensive current artificial intelligence research, and machine-learning frameworks. This could take the form of an attention-driven signal enhancement and noise suppression platform, together with context awareness, which would improve the interpersonal communication experience in complex real-life situations. In that sense, an AR platform could serve as a frontend to current and future hearing solutions. The AR device would enhance the signals to be attended, but the hearing amplification would still be handled by hearing aids. In this article, suggestions are made about why AR platforms may offer ideal affordances to compensate for hearing loss, and how research-focused AR platforms could help toward better understanding of the role of hearing in everyday life.
Collapse
|
83
|
Chen H, Du J, Hu Y, Dai LR, Yin BC, Lee CH. Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement. Neural Netw 2021; 143:171-182. [PMID: 34157642 DOI: 10.1016/j.neunet.2021.06.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 04/17/2021] [Accepted: 06/03/2021] [Indexed: 11/26/2022]
Abstract
In this paper, we propose a visual embedding approach to improve embedding aware speech enhancement (EASE) by synchronizing visual lip frames at the phone and place of articulation levels. We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE). Next, we extract audio-visual embedding from noisy speech and lip frames in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE). Experiments on the TCD-TIMIT corpus corrupted by simulated additive noises show that our proposed subword based VEASE approach is more effective than conventional embedding at the word level. Moreover, visual embedding at the articulation place level, leveraging upon a high correlation between place of articulation and lip shapes, demonstrates an even better performance than that at the phone level. Finally the experiments establish that the proposed MEASE framework, incorporating both audio and visual embeddings, yields significantly better speech quality and intelligibility than those obtained with the best visual-only and audio-only EASE systems.
Collapse
Affiliation(s)
- Hang Chen
- National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China
| | - Jun Du
- National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China.
| | - Yu Hu
- National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China
| | - Li-Rong Dai
- National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China
| | - Bao-Cai Yin
- iFlytek Research, iFlytek Co., Ltd., Hefei, Anhui, China
| | - Chin-Hui Lee
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
84
|
Song J, Abay R, Tyo JS, Alenin AS. Transcending conventional snapshot polarimeter performance via neuromorphically adaptive filters. OPTICS EXPRESS 2021; 29:17758-17774. [PMID: 34154052 DOI: 10.1364/oe.426072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 05/11/2021] [Indexed: 06/13/2023]
Abstract
A channeled Stokes polarimeter that recovers polarimetric signatures across the scene from the modulation induced channels is preferrable for many polarimetric sensing applications. Conventional channeled systems that isolate the intended channels with low-pass filters are sensitive to channel crosstalk effects, and the filters have to be optimized based on the bandwidth profile of scene of interest before applying to each particular scenes to be measured. Here, we introduce a machine learning based channel filtering framework for channeled polarimeters. The machines are trained to predict anti-aliasing filters according to the distribution of the measured data adaptively. A conventional snapshot Stokes polarimeter is simulated to present our machine learning based channel filtering framework. Finally, we demonstrate the advantage of our filtering framework through the comparison of reconstructed polarimetric images with the conventional image reconstruction procedure.
Collapse
|
85
|
Wei Y, Zhang K, Wu D, Hu Z. Exploring conventional enhancement and separation methods for multi‐speech enhancement in indoor environments. COGNITIVE COMPUTATION AND SYSTEMS 2021. [DOI: 10.1049/ccs2.12023] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Yangjie Wei
- College of Computer Science and Engineering Northeastern University Shenyang China
| | - Ke Zhang
- College of Computer Science and Engineering Northeastern University Shenyang China
| | - Dan Wu
- College of Computer Science and Engineering Northeastern University Shenyang China
| | - Zhongqi Hu
- College of Computer Science and Engineering Northeastern University Shenyang China
| |
Collapse
|
86
|
Wang ZQ, Wang P, Wang D. Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2021; 29:2001-2014. [PMID: 34212067 PMCID: PMC8240467 DOI: 10.1109/taslp.2021.3083405] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for speaker separation in reverberant conditions. We aim at both speaker separation and dereverberation. Our study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation (CSS). Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online CSS. Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
Collapse
Affiliation(s)
- Zhong-Qiu Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA, while performing this work. He is now with Mitsubishi Electric Research Laboratories, Cambridge, MA 02139, USA
| | - Peidong Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering & the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277 USA
| |
Collapse
|
87
|
Tan K, Wang D. Towards Model Compression for Deep Learning Based Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2021; 29:1785-1794. [PMID: 34179220 PMCID: PMC8224477 DOI: 10.1109/taslp.2021.3082282] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
The use of deep neural networks (DNNs) has dramatically elevated the performance of speech enhancement over the last decade. However, to achieve strong enhancement performance typically requires a large DNN, which is both memory and computation consuming, making it difficult to deploy such speech enhancement systems on devices with limited hardware resources or in applications with strict latency requirements. In this study, we propose two compression pipelines to reduce the model size for DNN-based speech enhancement, which incorporates three different techniques: sparse regularization, iterative pruning and clustering-based quantization. We systematically investigate these techniques and evaluate the proposed compression pipelines. Experimental results demonstrate that our approach reduces the sizes of four different models by large margins without significantly sacrificing their enhancement performance. In addition, we find that the proposed approach performs well on speaker separation, which further demonstrates the effectiveness of the approach for compressing speech separation models.
Collapse
Affiliation(s)
- Ke Tan
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210-1277 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277, USA
| |
Collapse
|
88
|
Zhang H, Wang D. Deep ANC: A deep learning approach to active noise control. Neural Netw 2021; 141:1-10. [PMID: 33839375 DOI: 10.1016/j.neunet.2021.03.037] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Revised: 03/24/2021] [Accepted: 03/26/2021] [Indexed: 11/17/2022]
Abstract
Traditional active noise control (ANC) methods are based on adaptive signal processing with the least mean square algorithm as the foundation. They are linear systems and do not perform satisfactorily in the presence of nonlinear distortions. In this paper, we formulate ANC as a supervised learning problem and propose a deep learning approach, called deep ANC, to address the nonlinear ANC problem. The main idea is to employ deep learning to encode the optimal control parameters corresponding to different noises and environments. A convolutional recurrent network (CRN) is trained to estimate the real and imaginary spectrograms of the canceling signal from the reference signal so that the corresponding anti-noise can eliminate or attenuate the primary noise in the ANC system. Large-scale multi-condition training is employed to achieve good generalization and robustness against a variety of noises. The deep ANC method can be trained to achieve active noise cancellation no matter whether the reference signal is noise or noisy speech. In addition, a delay-compensated strategy is introduced to solve the potential latency problem of ANC systems. Experimental results show that deep ANC is effective for wideband noise reduction and generalizes well to untrained noises. Moreover, the proposed method can achieve ANC within a quiet zone and is robust against variations in reference signals.
Collapse
Affiliation(s)
- Hao Zhang
- Department of Computer Science and Engineering, Ohio State University, Columbus, OH 43210-1277, USA.
| | - DeLiang Wang
- Department of Computer Science and Engineering, Ohio State University, Columbus, OH 43210-1277, USA; Center for Cognitive and Brain Sciences, Ohio State University, Columbus, OH 43210-1277, USA.
| |
Collapse
|
89
|
Bai Z, Zhang XL. Speaker recognition based on deep learning: An overview. Neural Netw 2021; 140:65-99. [PMID: 33744714 DOI: 10.1016/j.neunet.2021.03.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 03/01/2021] [Accepted: 03/01/2021] [Indexed: 11/17/2022]
Abstract
Speaker recognition is a task of identifying persons from their voices. Recently, deep learning has dramatically revolutionized speaker recognition. However, there is lack of comprehensive reviews on the exciting progress. In this paper, we review several major subtasks of speaker recognition, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods. Because the major advantage of deep learning over conventional methods is its representation ability, which is able to produce highly abstract embedding features from utterances, we first pay close attention to deep-learning-based speaker feature extraction, including the inputs, network structures, temporal pooling strategies, and objective functions respectively, which are the fundamental components of many speaker recognition subtasks. Then, we make an overview of speaker diarization, with an emphasis of recent supervised, end-to-end, and online diarization. Finally, we survey robust speaker recognition from the perspectives of domain adaptation and speech enhancement, which are two major approaches of dealing with domain mismatch and noise problems. Popular and recently released corpora are listed at the end of the paper.
Collapse
Affiliation(s)
- Zhongxin Bai
- Center of Intelligent Acoustics and Immersive Communications (CIAIC) and the School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an Shaanxi 710072, China.
| | - Xiao-Lei Zhang
- Center of Intelligent Acoustics and Immersive Communications (CIAIC) and the School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an Shaanxi 710072, China.
| |
Collapse
|
90
|
Pandey A, Wang D. Dense CNN with Self-Attention for Time-Domain Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2021; 29:1270-1279. [PMID: 33997107 PMCID: PMC8118093 DOI: 10.1109/taslp.2021.3064421] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the decoder comprises a dense block and an attention module. Dense blocks and attention modules help in feature extraction using a combination of feature reuse, increased network depth, and maximum context aggregation. Furthermore, we reveal previously unknown problems with a loss based on the spectral magnitude of enhanced speech. To alleviate these problems, we propose a novel loss based on magnitudes of enhanced speech and a predicted noise. Even though the proposed loss is based on magnitudes only, a constraint imposed by noise prediction ensures that the loss enhances both magnitude and phase. Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.
Collapse
Affiliation(s)
- Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
91
|
Liu H, Yuan P, Yang B, Yang G, Chen Y. Head‐related transfer function–reserved time‐frequency masking for robust binaural sound source localization. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2021. [DOI: 10.1049/cit2.12010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Hong Liu
- Key Laboratory of Machine Perception Shenzhen Graduate School Peking University Shenzhen China
| | - Peipei Yuan
- Key Laboratory of Machine Perception Shenzhen Graduate School Peking University Shenzhen China
| | - Bing Yang
- Key Laboratory of Machine Perception Shenzhen Graduate School Peking University Shenzhen China
| | - Ge Yang
- School of Artificial Intelligence Chongqing University of Technology Chongqing China
| | - Yang Chen
- Yanka Kupala State University of Grodno Grodno Belarus
| |
Collapse
|
92
|
Gupta S, Patil AT, Purohit M, Parmar M, Patel M, Patil HA, Guido RC. Residual Neural Network precisely quantifies dysarthria severity-level based on short-duration speech segments. Neural Netw 2021; 139:105-117. [PMID: 33684609 DOI: 10.1016/j.neunet.2021.02.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 01/24/2021] [Accepted: 02/08/2021] [Indexed: 10/22/2022]
Abstract
Recently, we have witnessed Deep Learning methodologies gaining significant attention for severity-based classification of dysarthric speech. Detecting dysarthria, quantifying its severity, are of paramount importance in various real-life applications, such as the assessment of patients' progression in treatments, which includes an adequate planning of their therapy and the improvement of speech-based interactive systems in order to handle pathologically-affected voices automatically. Notably, current speech-powered tools often deal with short-duration speech segments and, consequently, are less efficient in dealing with impaired speech, even by using Convolutional Neural Networks (CNNs). Thus, detecting dysarthria severity-level based on short speech segments might help in improving the performance and applicability of those systems. To achieve this goal, we propose a novel Residual Network (ResNet)-based technique which receives short-duration speech segments as input. Statistically meaningful objective analysis of our experiments, reported over standard Universal Access corpus, exhibits average values of 21.35% and 22.48% improvement, compared to the baseline CNN, in terms of classification accuracy and F1-score, respectively. For additional comparisons, tests with Gaussian Mixture Models and Light CNNs were also performed. Overall, the values of 98.90% and 98.00% for classification accuracy and F1-score, respectively, were obtained with the proposed ResNet approach, confirming its efficacy and reassuring its practical applicability.
Collapse
Affiliation(s)
- Siddhant Gupta
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | - Ankur T Patil
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | - Mirali Purohit
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | | | - Maitreya Patel
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | - Hemant A Patil
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | - Rodrigo Capobianco Guido
- Instituto de Biociências, Letras e Ciências Exatas, Unesp - Univ Estadual Paulista (São Paulo State University), Rua Cristóvão Colombo 2265, Jd Nazareth, 15054-000, São José do Rio Preto - SP, Brazil.
| |
Collapse
|
93
|
Few-shot pulse wave contour classification based on multi-scale feature extraction. Sci Rep 2021; 11:3762. [PMID: 33580107 PMCID: PMC7881007 DOI: 10.1038/s41598-021-83134-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Accepted: 01/14/2021] [Indexed: 11/22/2022] Open
Abstract
The annotation procedure of pulse wave contour (PWC) is expensive and time-consuming, thereby hindering the formation of large-scale datasets to match the requirements of deep learning. To obtain better results under the condition of few-shot PWC, a small-parameter unit structure and a multi-scale feature-extraction model are proposed. In the small-parameter unit structure, information of adjacent cells is transmitted through state variables. Simultaneously, a forgetting gate is used to update the information and retain long-term dependence of PWC in the form of unit series. The multi-scale feature-extraction model is an integrated model containing three parts. Convolution neural networks are used to extract spatial features of single-period PWC and rhythm features of multi-period PWC. Recursive neural networks are used to retain the long-term dependence features of PWC. Finally, an inference layer is used for classification through extracted features. Classification experiments of cardiovascular diseases are performed on photoplethysmography dataset and continuous non-invasive blood pressure dataset. Results show that the classification accuracy of the multi-scale feature-extraction model on the two datasets respectively can reach 80% and 96%, respectively.
Collapse
|
94
|
Lin TH, Akamatsu T, Tsao Y. Sensing ecosystem dynamics via audio source separation: A case study of marine soundscapes off northeastern Taiwan. PLoS Comput Biol 2021; 17:e1008698. [PMID: 33600436 PMCID: PMC7891715 DOI: 10.1371/journal.pcbi.1008698] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2020] [Accepted: 01/12/2021] [Indexed: 11/28/2022] Open
Abstract
Remote acquisition of information on ecosystem dynamics is essential for conservation management, especially for the deep ocean. Soundscape offers unique opportunities to study the behavior of soniferous marine animals and their interactions with various noise-generating activities at a fine temporal resolution. However, the retrieval of soundscape information remains challenging owing to limitations in audio analysis techniques that are effective in the face of highly variable interfering sources. This study investigated the application of a seafloor acoustic observatory as a long-term platform for observing marine ecosystem dynamics through audio source separation. A source separation model based on the assumption of source-specific periodicity was used to factorize time-frequency representations of long-duration underwater recordings. With minimal supervision, the model learned to discriminate source-specific spectral features and prove to be effective in the separation of sounds made by cetaceans, soniferous fish, and abiotic sources from the deep-water soundscapes off northeastern Taiwan. Results revealed phenological differences among the sound sources and identified diurnal and seasonal interactions between cetaceans and soniferous fish. The application of clustering to source separation results generated a database featuring the diversity of soundscapes and revealed a compositional shift in clusters of cetacean vocalizations and fish choruses during diurnal and seasonal cycles. The source separation model enables the transformation of single-channel audio into multiple channels encoding the dynamics of biophony, geophony, and anthropophony, which are essential for characterizing the community of soniferous animals, quality of acoustic habitat, and their interactions. Our results demonstrated the application of source separation could facilitate acoustic diversity assessment, which is a crucial task in soundscape-based ecosystem monitoring. Future implementation of soundscape information retrieval in long-term marine observation networks will lead to the use of soundscapes as a new tool for conservation management in an increasingly noisy ocean.
Collapse
Affiliation(s)
- Tzu-Hao Lin
- Biodiversity Research Center, Academia Sinica, Taipei, Taiwan (R.O.C)
| | - Tomonari Akamatsu
- The Ocean Policy Research Institute, The Sasakawa Peace Foundation, Tokyo, Japan
| | - Yu Tsao
- Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan (R.O.C)
| |
Collapse
|
95
|
Sun C, Zhang M, Wu R, Lu J, Xian G, Yu Q, Gong X, Luo R. A convolutional recurrent neural network with attention framework for speech separation in monaural recordings. Sci Rep 2021; 11:1434. [PMID: 33446851 PMCID: PMC7809293 DOI: 10.1038/s41598-020-80713-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 12/21/2020] [Indexed: 11/29/2022] Open
Abstract
Most speech separation studies in monaural channel use only a single type of network, and the separation effect is typically not satisfactory, posing difficulties for high quality speech separation. In this study, we propose a convolutional recurrent neural network with an attention (CRNN-A) framework for speech separation, fusing advantages of two networks together. The proposed separation framework uses a convolutional neural network (CNN) as the front-end of a recurrent neural network (RNN), alleviating the problem that a sole RNN cannot effectively learn the necessary features. This framework makes use of the translation invariance provided by CNN to extract information without modifying the original signals. Within the supplemented CNN, two different convolution kernels are designed to capture information in both the time and frequency domains of the input spectrogram. After concatenating the time-domain and the frequency-domain feature maps, the feature information of speech is exploited through consecutive convolutional layers. Finally, the feature map learned from the front-end CNN is combined with the original spectrogram and is sent to the back-end RNN. Further, the attention mechanism is further incorporated, focusing on the relationship among different feature maps. The effectiveness of the proposed method is evaluated on the standard dataset MIR-1K and the results prove that the proposed method outperforms the baseline RNN and other popular speech separation methods, in terms of GNSDR (gloabl normalised source-to-distortion ratio), GSIR (global source-to-interferences ratio), and GSAR (gloabl source-to-artifacts ratio). In summary, the proposed CRNN-A framework can effectively combine the advantages of CNN and RNN, and further optimise the separation performance via the attention mechanism. The proposed framework can shed a new light on speech separation, speech enhancement, and other related fields.
Collapse
Affiliation(s)
- Chao Sun
- College of Electrical Engineering, Sichuan University, Chengdu, 610065, China
| | - Min Zhang
- Institute of Urban and Rural Planning and Design Zhejiang, Hangzhou, 310007, China
| | - Ruijuan Wu
- Chengdu Dagongbochuang Information Technology Co., Ltd., Chengdu, 610059, China
| | - Junhong Lu
- College of Electrical Engineering, Sichuan University, Chengdu, 610065, China
| | - Guo Xian
- Chengdu Dagongbochuang Information Technology Co., Ltd., Chengdu, 610059, China
| | - Qin Yu
- College of Electrical Engineering, Sichuan University, Chengdu, 610065, China
| | - Xiaofeng Gong
- College of Electrical Engineering, Sichuan University, Chengdu, 610065, China
| | - Ruisen Luo
- College of Electrical Engineering, Sichuan University, Chengdu, 610065, China.
| |
Collapse
|
96
|
Li H, Xu Y, Ke D, Su K. μ-law SGAN for generating spectra with more details in speech enhancement. Neural Netw 2021; 136:17-27. [PMID: 33422929 DOI: 10.1016/j.neunet.2020.12.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Revised: 12/16/2020] [Accepted: 12/17/2020] [Indexed: 11/30/2022]
Abstract
The goal of monaural speech enhancement is to separate clean speech from noisy speech. Recently, many studies have employed generative adversarial networks (GAN) to deal with monaural speech enhancement tasks. When using generative adversarial networks for this task, the output of the generator is a speech waveform or a spectrum, such as a magnitude spectrum, a mel-spectrum or a complex-valued spectrum. The spectra generated by current speech enhancement methods in the time-frequency domain usually lack details, such as consonants and harmonics with low energy. In this paper, we propose a new type of adversarial training framework for spectrum generation, named μ-law spectrum generative adversarial networks (μ-law SGAN). We introduce a trainable μ-law spectrum compression layer (USCL) into the proposed discriminator to compress the dynamic range of the spectrum. As a result, the compressed spectrum can display more detailed information. In addition, we use the spectrum transformed by USCL to regularize the generator's training, so that the generator can pay more attention to the details of the spectrum. Experimental results on the open dataset Voice Bank + DEMAND show that μ-law SGAN is an effective generative adversarial architecture for speech enhancement. Moreover, visual spectrogram analysis suggests that μ-law SGAN pays more attention to the enhancement of low energy harmonics and consonants.
Collapse
Affiliation(s)
- Hongfeng Li
- School of Information Science and Technology, Beijing Forestry University, 35 Qing-Hua East Road, Beijing 100083, China; Engineering Research Center for Forestry-oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China.
| | - Yanyan Xu
- School of Information Science and Technology, Beijing Forestry University, 35 Qing-Hua East Road, Beijing 100083, China; Engineering Research Center for Forestry-oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China.
| | - Dengfeng Ke
- School of Information Science, Beijing Language and Culture University, Beijing 100083, China.
| | - Kaile Su
- Institute for Integrated and Intelligent Systems, Griffith University, Nathan, QLD 4111, Australia.
| |
Collapse
|
97
|
Saleem N, Khattak MI, Al-Hasan M, Jan A. Learning time-frequency mask for noisy speech enhancement using gaussian-bernoulli pre-trained deep neural networks. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2021. [DOI: 10.3233/jifs-201014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Speech enhancement is a very important problem in various speech processing applications. Recently, supervised speech enhancement using deep learning approaches to estimate a time-frequency mask have proved remarkable performance gain. In this paper, we have proposed time-frequency masking-based supervised speech enhancement method for improving intelligibility and quality of the noisy speech. We believe that a large performance gain can be achieved if deep neural networks (DNNs) are layer-wise pre-trained by stacking Gaussian-Bernoulli Restricted Boltzmann Machine (GB-RBM). The proposed DNN is called as Gaussian-Bernoulli Deep Belief Network (GB-DBN) and are optimized by minimizing errors between the estimated and pre-defined masks. Non-linear Mel-Scale weighted mean square error (LMW-MSE) loss function is used as training criterion. We have examined the performance of the proposed pre-training scheme using different DNNs which are established on three time-frequency masks comprised of the ideal amplitude mask (IAM), ideal ratio mask (IRM), and phase sensitive mask (PSM). The results in different noisy conditions demonstrated that when DNNs are pre-trained by the proposed scheme provided a persistent performance gain in terms of the perceived speech intelligibility and quality. Also, the proposed pre-training scheme is effective and robust in noisy training data.
Collapse
Affiliation(s)
- Nasir Saleem
- Department of Electrical Engineering, University of Engineering & Technology, Peshawar, Pakistan
- Department of Electrical Engineering, FET, Gomal University, Dera Ismail Khan, Pakistan
| | - Muhammad Irfan Khattak
- Department of Electrical Engineering, University of Engineering & Technology, Peshawar, Pakistan
- Department of Electrical Engineering, FET, Gomal University, Dera Ismail Khan, Pakistan
| | - Mu’ath Al-Hasan
- Collage of Engineering, Al Ain University, United Arab Emirates (UAE)
| | - Atif Jan
- Department of Electrical Engineering, University of Engineering & Technology, Peshawar, Pakistan
| |
Collapse
|
98
|
Tan K, Zhang X, Wang D. Deep Learning Based Real-time Speech Enhancement for Dual-microphone Mobile Phones. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2021; 29:1853-1863. [PMID: 34179221 PMCID: PMC8224499 DOI: 10.1109/taslp.2021.3082318] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
In mobile speech communication, speech signals can be severely corrupted by background noise when the far-end talker is in a noisy acoustic environment. To suppress background noise, speech enhancement systems are typically integrated into mobile phones, in which one or more microphones are deployed. In this study, we propose a novel deep learning based approach to real-time speech enhancement for dual-microphone mobile phones. The proposed approach employs a new densely-connected convolutional recurrent network to perform dual-channel complex spectral mapping. We utilize a structured pruning technique to compress the model without significantly degrading the enhancement performance, which yields a low-latency and memory-efficient enhancement system for real-time processing. Experimental results suggest that the proposed approach consistently outperforms an earlier approach to dual-channel speech enhancement for mobile phone communication, as well as a deep learning based beamformer.
Collapse
Affiliation(s)
- Ke Tan
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210-1277 USA
| | - Xueliang Zhang
- Department of Computer Science, Inner Mongolia University, Hohhot 010021, China
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277, USA
| |
Collapse
|
99
|
An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement. ELECTRONICS 2020. [DOI: 10.3390/electronics10010017] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Recent speech enhancement research has shown that deep learning techniques are very effective in removing background noise. Many deep neural networks are being proposed, showing promising results for improving overall speech perception. The Deep Multilayer Perceptron, Convolutional Neural Networks, and the Denoising Autoencoder are well-established architectures for speech enhancement; however, choosing between different deep learning models has been mainly empirical. Consequently, a comparative analysis is needed between these three architecture types in order to show the factors affecting their performance. In this paper, this analysis is presented by comparing seven deep learning models that belong to these three categories. The comparison includes evaluating the performance in terms of the overall quality of the output speech using five objective evaluation metrics and a subjective evaluation with 23 listeners; the ability to deal with challenging noise conditions; generalization ability; complexity; and, processing time. Further analysis is then provided while using two different approaches. The first approach investigates how the performance is affected by changing network hyperparameters and the structure of the data, including the Lombard effect. While the second approach interprets the results by visualizing the spectrogram of the output layer of all the investigated models, and the spectrograms of the hidden layers of the convolutional neural network architecture. Finally, a general evaluation is performed for supervised deep learning-based speech enhancement while using SWOC analysis, to discuss the technique’s Strengths, Weaknesses, Opportunities, and Challenges. The results of this paper contribute to the understanding of how different deep neural networks perform the speech enhancement task, highlight the strengths and weaknesses of each architecture, and provide recommendations for achieving better performance. This work facilitates the development of better deep neural networks for speech enhancement in the future.
Collapse
|
100
|
Wang NYH, Wang HLS, Wang TW, Fu SW, Lu X, Wang HM, Tsao Y. Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks. IEEE Trans Neural Syst Rehabil Eng 2020; 29:184-195. [PMID: 33275585 DOI: 10.1109/tnsre.2020.3042655] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Combined electric and acoustic stimulation (EAS) has demonstrated better speech recognition than conventional cochlear implant (CI) and yielded satisfactory performance under quiet conditions. However, when noise signals are involved, both the electric signal and the acoustic signal may be distorted, thereby resulting in poor recognition performance. To suppress noise effects, speech enhancement (SE) is a necessary unit in EAS devices. Recently, a time-domain speech enhancement algorithm based on the fully convolutional neural networks (FCN) with a short-time objective intelligibility (STOI)-based objective function (termed FCN(S) in short) has received increasing attention due to its simple structure and effectiveness of restoring clean speech signals from noisy counterparts. With evidence showing the benefits of FCN(S) for normal speech, this study sets out to assess its ability to improve the intelligibility of EAS simulated speech. Objective evaluations and listening tests were conducted to examine the performance of FCN(S) in improving the speech intelligibility of normal and vocoded speech in noisy environments. The experimental results show that, compared with the traditional minimum-mean square-error SE method and the deep denoising autoencoder SE method, FCN(S) can obtain better gain in the speech intelligibility for normal as well as vocoded speech. This study, being the first to evaluate deep learning SE approaches for EAS, confirms that FCN(S) is an effective SE approach that may potentially be integrated into an EAS processor to benefit users in noisy environments.
Collapse
|