101
|
Ceolini E, Hjortkjær J, Wong DDE, O'Sullivan J, Raghavan VS, Herrero J, Mehta AD, Liu SC, Mesgarani N. Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception. Neuroimage 2020; 223:117282. [PMID: 32828921 PMCID: PMC8056438 DOI: 10.1016/j.neuroimage.2020.117282] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 07/12/2020] [Accepted: 08/15/2020] [Indexed: 11/15/2022] Open
Abstract
Hearing-impaired people often struggle to follow the speech stream of an individual talker in noisy environments. Recent studies show that the brain tracks attended speech and that the attended talker can be decoded from neural data on a single-trial level. This raises the possibility of "neuro-steered" hearing devices in which the brain-decoded intention of a hearing-impaired listener is used to enhance the voice of the attended speaker from a speech separation front-end. So far, methods that use this paradigm have focused on optimizing the brain decoding and the acoustic speech separation independently. In this work, we propose a novel framework called brain-informed speech separation (BISS)1 in which the information about the attended speech, as decoded from the subject's brain, is directly used to perform speech separation in the front-end. We present a deep learning model that uses neural data to extract the clean audio signal that a listener is attending to from a multi-talker speech mixture. We show that the framework can be applied successfully to the decoded output from either invasive intracranial electroencephalography (iEEG) or non-invasive electroencephalography (EEG) recordings from hearing-impaired subjects. It also results in improved speech separation, even in scenes with background noise. The generalization capability of the system renders it a perfect candidate for neuro-steered hearing-assistive devices.
Collapse
Affiliation(s)
- Enea Ceolini
- University of Zürich and ETH Zürich, Institute of Neuroinformatics, Switzerland.
| | - Jens Hjortkjær
- Department of Health Technology, Danmarks Tekniske Universitet DTU, Kongens Lyngby, Denmark; Danish Research Centre for Magnetic Resonance, Copenhagen University Hospital Hvidovre, Hvidovre, Denmark
| | - Daniel D E Wong
- Laboratoire des Systèmes Perceptifs, CNRS, UMR 8248, Paris, France; Département d'Études Cognitives, École Normale Supérieure, PSL Research University, Paris, France
| | - James O'Sullivan
- Department of Electrical Engineering, Columbia University, New York, NY, USA; Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA
| | - Vinay S Raghavan
- Department of Electrical Engineering, Columbia University, New York, NY, USA; Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA
| | - Jose Herrero
- Department of Neurosurgery, Hofstra-Northwell School of Medicine and Feinstein Institute for Medical Research, Manhasset, New York, NY, USA
| | - Ashesh D Mehta
- Department of Neurosurgery, Hofstra-Northwell School of Medicine and Feinstein Institute for Medical Research, Manhasset, New York, NY, USA
| | - Shih-Chii Liu
- University of Zürich and ETH Zürich, Institute of Neuroinformatics, Switzerland
| | - Nima Mesgarani
- Department of Electrical Engineering, Columbia University, New York, NY, USA; Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA.
| |
Collapse
|
102
|
Schlittenlacher J, Turner RE, Moore BCJ. Development of a Deep Neural Network for Speeding Up a Model of Loudness for Time-Varying Sounds. Trends Hear 2020; 24:2331216520943074. [PMID: 32853098 PMCID: PMC7457659 DOI: 10.1177/2331216520943074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The “time-varying loudness” (TVL) model of Glasberg and Moore calculates
“instantaneous loudness” every 1 ms, and this is used to generate
predictions of short-term loudness, the loudness of a short segment of
sound, such as a word in a sentence, and of long-term loudness, the
loudness of a longer segment of sound, such as a whole sentence. The
calculation of instantaneous loudness is computationally intensive and
real-time implementation of the TVL model is difficult. To speed up
the computation, a deep neural network (DNN) was trained to predict
instantaneous loudness using a large database of speech sounds and
artificial sounds (tones alone and tones in white or pink noise), with
the predictions of the TVL model as a reference (providing the
“correct” answer, specifically the loudness level in phons). A
multilayer perceptron with three hidden layers was found to be
sufficient, with more complex DNN architecture not yielding higher
accuracy. After training, the deviations between the predictions of
the TVL model and the predictions of the DNN were typically less than
0.5 phons, even for types of sounds that were not used for training
(music, rain, animal sounds, and washing machine). The DNN calculates
instantaneous loudness over 100 times more quickly than the TVL model.
Possible applications of the DNN are discussed.
Collapse
Affiliation(s)
| | | | - Brian C J Moore
- Department of Experimental Psychology, University of Cambridge
| |
Collapse
|
103
|
Wearable Hearing Device Spectral Enhancement Driven by Non-Negative Sparse Coding-Based Residual Noise Reduction. SENSORS 2020; 20:s20205751. [PMID: 33050447 PMCID: PMC7600179 DOI: 10.3390/s20205751] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/02/2020] [Accepted: 10/06/2020] [Indexed: 11/17/2022]
Abstract
This paper proposes a novel technique to improve a spectral statistical filter for speech enhancement, to be applied in wearable hearing devices such as hearing aids. The proposed method is implemented considering a 32-channel uniform polyphase discrete Fourier transform filter bank, for which the overall algorithm processing delay is 8 ms in accordance with the hearing device requirements. The proposed speech enhancement technique, which exploits the concepts of both non-negative sparse coding (NNSC) and spectral statistical filtering, provides an online unified framework to overcome the problem of residual noise in spectral statistical filters under noisy environments. First, the spectral gain attenuator of the statistical Wiener filter is obtained using the a priori signal-to-noise ratio (SNR) estimated through a decision-directed approach. Next, the spectrum estimated using the Wiener spectral gain attenuator is decomposed by applying the NNSC technique to the target speech and residual noise components. These components are used to develop an NNSC-based Wiener spectral gain attenuator to achieve enhanced speech. The performance of the proposed NNSC-Wiener filter was evaluated through a perceptual evaluation of the speech quality scores under various noise conditions with SNRs ranging from -5 to 20 dB. The results indicated that the proposed NNSC-Wiener filter can outperform the conventional Wiener filter and NNSC-based speech enhancement methods at all SNRs.
Collapse
|
104
|
Fernandez-Blanco E, Rivero D, Pazos A. EEG signal processing with separable convolutional neural network for automatic scoring of sleeping stage. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.05.085] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
105
|
Zhou Y, Chen Y, Ma Y, Liu H. A Real-Time Dual-Microphone Speech Enhancement Algorithm Assisted by Bone Conduction Sensor. SENSORS (BASEL, SWITZERLAND) 2020; 20:E5050. [PMID: 32899533 PMCID: PMC7571026 DOI: 10.3390/s20185050] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 09/02/2020] [Accepted: 09/03/2020] [Indexed: 11/16/2022]
Abstract
The quality and intelligibility of the speech are usually impaired by the interference of background noise when using internet voice calls. To solve this problem in the context of wearable smart devices, this paper introduces a dual-microphone, bone-conduction (BC) sensor assisted beamformer and a simple recurrent unit (SRU)-based neural network postfilter for real-time speech enhancement. Assisted by the BC sensor, which is insensitive to the environmental noise compared to the regular air-conduction (AC) microphone, the accurate voice activity detection (VAD) can be obtained from the BC signal and incorporated into the adaptive noise canceller (ANC) and adaptive block matrix (ABM). The SRU-based postfilter consists of a recurrent neural network with a small number of parameters, which improves the computational efficiency. The sub-band signal processing is designed to compress the input features of the neural network, and the scale-invariant signal-to-distortion ratio (SI-SDR) is developed as the loss function to minimize the distortion of the desired speech signal. Experimental results demonstrate that the proposed real-time speech enhancement system provides significant speech sound quality and intelligibility improvements for all noise types and levels when compared with the AC-only beamformer with a postfiltering algorithm.
Collapse
Affiliation(s)
- Yi Zhou
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (Y.C.)
| | - Yufan Chen
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (Y.C.)
| | - Yongbao Ma
- Suresense Technology, Chongqing 400065, China;
| | - Hongqing Liu
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (Y.C.)
| |
Collapse
|
106
|
Delfarah M, Liu Y, Wang D. A two-stage deep learning algorithm for talker-independent speaker separation in reverberant conditions. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 148:1157. [PMID: 33003849 PMCID: PMC7473777 DOI: 10.1121/10.0001779] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Revised: 07/29/2020] [Accepted: 08/04/2020] [Indexed: 06/11/2023]
Abstract
Speaker separation is a special case of speech separation, in which the mixture signal comprises two or more speakers. Many talker-independent speaker separation methods have been introduced in recent years to address this problem in anechoic conditions. To consider more realistic environments, this paper investigates talker-independent speaker separation in reverberant conditions. To effectively deal with speaker separation and speech dereverberation, extending the deep computational auditory scene analysis (CASA) approach to a two-stage system is proposed. In this method, reverberant utterances are first separated and separated utterances are then dereverberated. The proposed two-stage deep CASA system significantly outperforms a baseline one-stage deep CASA method in real reverberant conditions. The proposed system has superior separation performance at the frame level and higher accuracy in assigning separated frames to individual speakers. The proposed system successfully generalizes to an unseen speech corpus and exhibits similar performance to a talker-dependent system.
Collapse
Affiliation(s)
- Masood Delfarah
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Yuzhou Liu
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
107
|
Pandey A, Wang D. On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2020; 28:2489-2499. [PMID: 33748327 PMCID: PMC7971413 DOI: 10.1109/taslp.2020.3016487] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
In recent years, supervised approaches using deep neural networks (DNNs) have become the mainstream for speech enhancement. It has been established that DNNs generalize well to untrained noises and speakers if trained using a large number of noises and speakers. However, we find that DNNs fail to generalize to new speech corpora in low signal-to-noise ratio (SNR) conditions. In this work, we establish that the lack of generalization is mainly due to the channel mismatch, i.e. different recording conditions between the trained and untrained corpus. Additionally, we observe that traditional channel normalization techniques are not effective in improving cross-corpus generalization. Further, we evaluate publicly available datasets that are promising for generalization. We find one particular corpus to be significantly better than others. Finally, we find that using a smaller frame shift in short-time processing of speech can significantly improve cross-corpus generalization. The proposed techniques to address cross-corpus generalization include channel normalization, better training corpus, and smaller frame shift in short-time Fourier transform (STFT). These techniques together improve the objective intelligibility and quality scores on untrained corpora significantly.
Collapse
Affiliation(s)
- Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
108
|
Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture. Symmetry (Basel) 2020. [DOI: 10.3390/sym12061051] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model.
Collapse
|
109
|
Wang ZQ, Wang P, Wang D. Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2020; 28:1778-1787. [PMID: 33748326 PMCID: PMC7971156 DOI: 10.1109/taslp.2020.2998279] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
This study proposes a complex spectral mapping approach for single- and multi-channel speech enhancement, where deep neural networks (DNNs) are used to predict the real and imaginary (RI) components of the direct-path signal from noisy and reverberant ones. The proposed system contains two DNNs. The first one performs single-channel complex spectral mapping. The estimated complex spectra are used to compute a minimum variance distortion-less response (MVDR) beamformer. The RI components of beamforming results, which encode spatial information, are then combined with the RI components of the mixture to train the second DNN for multi-channel complex spectral mapping. With estimated complex spectra, we also propose a novel method of time-varying beamforming. State-of-the-art performance is obtained on the speech enhancement and recognition tasks of the CHiME-4 corpus. More specifically, our system obtains 6.82%, 3.19% and 2.00% word error rates (WER) respectively on the single-, two-, and six-microphone tasks of CHiME-4, significantly surpassing the current best results of 9.15%, 3.91% and 2.24% WER.
Collapse
Affiliation(s)
- Zhong-Qiu Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA
| | - Peidong Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering & the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277 USA
| |
Collapse
|
110
|
Zhao Y, Wang D, Xu B, Zhang T. Monaural Speech Dereverberation Using Temporal Convolutional Networks with Self Attention. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2020; 28:1598-1607. [PMID: 33748325 PMCID: PMC7971181 DOI: 10.1109/taslp.2020.2995273] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
In daily listening environments, human speech is often degraded by room reverberation, especially under highly reverberant conditions. Such degradation poses a challenge for many speech processing systems, where the performance becomes much worse than in anechoic environments. To combat the effect of reverberation, we propose a monaural (single-channel) speech dereverberation algorithm using temporal convolutional networks with self attention. Specifically, the proposed system includes a self-attention module to produce dynamic representations given input features, a temporal convolutional network to learn a nonlinear mapping from such representations to the magnitude spectrum of anechoic speech, and a one-dimensional (1-D) convolution module to smooth the enhanced magnitude among adjacent frames. Systematic evaluations demonstrate that the proposed algorithm improves objective metrics of speech quality in a wide range of reverberant conditions. In addition, it generalizes well to untrained reverberation times, room sizes, measured room impulse responses, real-world recorded noisy-reverberant speech, and different speakers.
Collapse
Affiliation(s)
- Yan Zhao
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH, 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH, 43210 USA. He also held a visiting appointment at the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University, Xi'an, China
| | - Buye Xu
- Starkey Hearing Technologies, Eden Prairie, MN 55344 USA. He is now with Facebook Reality Labs, Facebook, Inc., Redmond, WA 98052 USA
| | - Tao Zhang
- Starkey Hearing Technologies, Eden Prairie, MN 55344 USA
| |
Collapse
|
111
|
Lee CH, Fedorov I, Rao BD, Garudadri H. SSGD: SPARSITY-PROMOTING STOCHASTIC GRADIENT DESCENT ALGORITHM FOR UNBIASED DNN PRUNING. PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. ICASSP (CONFERENCE) 2020; 2020:5410-5414. [PMID: 33162834 PMCID: PMC7643773 DOI: 10.1109/icassp40776.2020.9054436] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
While deep neural networks (DNNs) have achieved state-of-the-art results in many fields, they are typically over-parameterized. Parameter redundancy, in turn, leads to inefficiency. Sparse signal recovery (SSR) techniques, on the other hand, find compact solutions to overcomplete linear problems. Therefore, a logical step is to draw the connection between SSR and DNNs. In this paper, we explore the application of iterative reweighting methods popular in SSR to learning efficient DNNs. By efficient, we mean sparse networks that require less computation and storage than the original, dense network. We propose a reweighting framework to learn sparse connections within a given architecture without biasing the optimization process, by utilizing the affine scaling transformation strategy. The resulting algorithm, referred to as Sparsity-promoting Stochastic Gradient Descent (SSGD), has simple gradient-based updates which can be easily implemented in existing deep learning libraries. We demonstrate the sparsification ability of SSGD on image classification tasks and show that it outperforms existing methods on the MNIST and CIFAR-10 datasets.
Collapse
Affiliation(s)
- Ching-Hua Lee
- Department of ECE, University of California, San Diego
| | | | - Bhaskar D Rao
- Department of ECE, University of California, San Diego
| | | |
Collapse
|
112
|
Wei Y, Zhou J, Wang Y, Liu Y, Liu Q, Luo J, Wang C, Ren F, Huang L. A Review of Algorithm & Hardware Design for AI-Based Biomedical Applications. IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2020; 14:145-163. [PMID: 32078560 DOI: 10.1109/tbcas.2020.2974154] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper reviews the state of the arts and trends of the AI-Based biomedical processing algorithms and hardware. The algorithms and hardware for different biomedical applications such as ECG, EEG and hearing aid have been reviewed and discussed. For algorithm design, various widely used biomedical signal classification algorithms have been discussed including support vector machine (SVM), back propagation neural network (BPNN), convolutional neural networks (CNN), probabilistic neural networks (PNN), recurrent neural networks (RNN), Short-term Memory Network (LSTM), fuzzy neural network and etc. The pros and cons of the classification algorithms have been analyzed and compared in the context of application scenarios. The research trends of AI-Based biomedical processing algorithms and applications are also discussed. For hardware design, various AI-Based biomedical processors have been reviewed and discussed, including ECG classification processor, EEG classification processor, EMG classification processor and hearing aid processor. Various techniques on architecture and circuit level have been analyzed and compared. The research trends of the AI-Based biomedical processor have also been discussed.
Collapse
|
113
|
Wang ZQ, Wang D. Deep Learning Based Target Cancellation for Speech Dereverberation. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2020; 28:941-950. [PMID: 33748324 PMCID: PMC7977279 DOI: 10.1109/taslp.2020.2975902] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This study investigates deep learning based single- and multi-channel speech dereverberation. For single-channel processing, we extend magnitude-domain masking and mapping based dereverberation to complex-domain mapping, where deep neural networks (DNNs) are trained to predict the real and imaginary (RI) components of the direct-path signal from reverberant (and noisy) ones. For multi-channel processing, we first compute a minimum variance distortionless response (MVDR) beamformer to cancel the direct-path signal, and then feed the RI components of the cancelled signal, which is expected to be a filtered version of non-target signals, as additional features to perform dereverberation. Trained on a large dataset of simulated room impulse responses, our models show excellent speech dereverberation and recognition performance on the test set of the REVERB challenge, consistently better than single- and multi-channel weighted prediction error (WPE) algorithms.
Collapse
Affiliation(s)
- Zhong-Qiu Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277 USA
| |
Collapse
|
114
|
Krishnagopal S, Girvan M, Ott E, Hunt BR. Separation of chaotic signals by reservoir computing. CHAOS (WOODBURY, N.Y.) 2020; 30:023123. [PMID: 32113243 DOI: 10.1063/1.5132766] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Accepted: 01/20/2020] [Indexed: 06/10/2023]
Abstract
We demonstrate the utility of machine learning in the separation of superimposed chaotic signals using a technique called reservoir computing. We assume no knowledge of the dynamical equations that produce the signals and require only training data consisting of finite-time samples of the component signals. We test our method on signals that are formed as linear combinations of signals from two Lorenz systems with different parameters. Comparing our nonlinear method with the optimal linear solution to the separation problem, the Wiener filter, we find that our method significantly outperforms the Wiener filter in all the scenarios we study. Furthermore, this difference is particularly striking when the component signals have similar frequency spectra. Indeed, our method works well when the component frequency spectra are indistinguishable-a case where a Wiener filter performs essentially no separation.
Collapse
Affiliation(s)
| | | | - Edward Ott
- University of Maryland, College Park, Maryland 20742, USA
| | - Brian R Hunt
- University of Maryland, College Park, Maryland 20742, USA
| |
Collapse
|
115
|
Lu Z, Kim JZ, Bassett DS. Supervised chaotic source separation by a tank of water. CHAOS (WOODBURY, N.Y.) 2020; 30:021101. [PMID: 32113226 PMCID: PMC7007304 DOI: 10.1063/1.5142462] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 12/27/2019] [Indexed: 06/10/2023]
Abstract
Whether listening to overlapping conversations in a crowded room or recording the simultaneous electrical activity of millions of neurons, the natural world abounds with sparse measurements of complex overlapping signals that arise from dynamical processes. While tools that separate mixed signals into linear sources have proven necessary and useful, the underlying equational forms of most natural signals are unknown and nonlinear. Hence, there is a need for a framework that is general enough to extract sources without knowledge of their generating equations and flexible enough to accommodate nonlinear, even chaotic, sources. Here, we provide such a framework, where the sources are chaotic trajectories from independently evolving dynamical systems. We consider the mixture signal as the sum of two chaotic trajectories and propose a supervised learning scheme that extracts the chaotic trajectories from their mixture. Specifically, we recruit a complex dynamical system as an intermediate processor that is constantly driven by the mixture. We then obtain the separated chaotic trajectories based on this intermediate system by training the proper output functions. To demonstrate the generalizability of this framework in silico, we employ a tank of water as the intermediate system and show its success in separating two-part mixtures of various chaotic trajectories. Finally, we relate the underlying mechanism of this method to the state-observer problem. This relation provides a quantitative theory that explains the performance of our method, and why separation is difficult when two source signals are trajectories from the same chaotic system.
Collapse
Affiliation(s)
- Zhixin Lu
- Department of Bioengineering, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Jason Z Kim
- Department of Bioengineering, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Danielle S Bassett
- Department of Bioengineering, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
116
|
Hueber T, Tatulli E, Girin L, Schwartz JL. Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning. Neural Comput 2020; 32:596-625. [PMID: 31951798 DOI: 10.1162/neco_a_01264] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Sensory processing is increasingly conceived in a predictive framework in which neurons would constantly process the error signal resulting from the comparison of expected and observed stimuli. Surprisingly, few data exist on the accuracy of predictions that can be computed in real sensory scenes. Here, we focus on the sensory processing of auditory and audiovisual speech. We propose a set of computational models based on artificial neural networks (mixing deep feedforward and convolutional networks), which are trained to predict future audio observations from present and past audio or audiovisual observations (i.e., including lip movements). Those predictions exploit purely local phonetic regularities with no explicit call to higher linguistic levels. Experiments are conducted on the multispeaker LibriSpeech audio speech database (around 100 hours) and on the NTCD-TIMIT audiovisual speech database (around 7 hours). They appear to be efficient in a short temporal range (25-50 ms), predicting 50% to 75% of the variance of the incoming stimulus, which could result in potentially saving up to three-quarters of the processing power. Then they quickly decrease and almost vanish after 250 ms. Adding information on the lips slightly improves predictions, with a 5% to 10% increase in explained variance. Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound.
Collapse
Affiliation(s)
- Thomas Hueber
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| | - Eric Tatulli
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| | - Laurent Girin
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France, and Inria Grenoble-Rhône-Alpes, 38330 Montbonnot-Saint Martin, France
| | - Jean-Luc Schwartz
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| |
Collapse
|
117
|
Wang L, Cavallaro A. Deep Learning Assisted Time-Frequency Processing for Speech Enhancement on Drones. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2020. [DOI: 10.1109/tetci.2020.3014934] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
118
|
Delfarah M, Wang D. Deep Learning for Talker-dependent Reverberant Speaker Separation: An Empirical Study. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2019; 27:1839-1848. [PMID: 33748321 PMCID: PMC7970708 DOI: 10.1109/taslp.2019.2934319] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Speaker separation refers to the problem of separating speech signals from a mixture of simultaneous speakers. Previous studies are limited to addressing the speaker separation problem in anechoic conditions. This paper addresses the problem of talker-dependent speaker separation in reverberant conditions, which are characteristic of real-world environments. We employ recurrent neural networks with bidirectional long short-term memory (BLSTM) to separate and dereverberate the target speech signal. We propose two-stage networks to effectively deal with both speaker separation and speech dereverberation. In the two-stage model, the first stage separates and dereverberates two-talker mixtures and the second stage further enhances the separated target signal. We have extensively evaluated the two-stage architecture, and our empirical results demonstrate large improvements over unprocessed mixtures and clear performance gain over single-stage networks in a wide range of target-to-interferer ratios and reverberation times in simulated as well as recorded rooms. Moreover, we show that time-frequency masking yields better performance than spectral mapping for reverberant speaker separation.
Collapse
Affiliation(s)
- Masood Delfarah
- Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| | - DeLiang Wang
- Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
119
|
Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. COMPUT SPEECH LANG 2019. [DOI: 10.1016/j.csl.2019.05.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
120
|
Bouwmans T, Javed S, Sultana M, Jung SK. Deep neural network concepts for background subtraction:A systematic review and comparative evaluation. Neural Netw 2019; 117:8-66. [DOI: 10.1016/j.neunet.2019.04.024] [Citation(s) in RCA: 173] [Impact Index Per Article: 28.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 02/27/2019] [Accepted: 04/30/2019] [Indexed: 12/16/2022]
|
121
|
Luo Y, Mesgarani N. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2019; 27:1256-1266. [PMID: 31485462 PMCID: PMC6726126 DOI: 10.1109/taslp.2019.2915167] [Citation(s) in RCA: 102] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency of the entire system. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a much shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.
Collapse
|
122
|
|
123
|
Xu S, Wang J, Wang R, Chen J, Zou W. High-accuracy optical convolution unit architecture for convolutional neural networks by cascaded acousto-optical modulator arrays. OPTICS EXPRESS 2019; 27:19778-19787. [PMID: 31503733 DOI: 10.1364/oe.27.019778] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Accepted: 06/13/2019] [Indexed: 06/10/2023]
Abstract
Optical neural networks (ONNs) have become competitive candidates for the next generation of high-performance neural network accelerators because of their low power consumption and high-speed nature. Beyond fully-connected neural networks demonstrated in pioneer works, optical computing hardwares can also conduct convolutional neural networks (CNNs) by hardware reusing. Following this concept, we propose an optical convolution unit (OCU) architecture. By reusing the OCU architecture with different inputs and weights, convolutions with arbitrary input sizes can be done. A proof-of-concept experiment is carried out by cascaded acousto-optical modulator arrays. When the neural network parameters are ex-situ trained, the OCU conducts convolutions with SDR up to 28.22 dBc and performs well on inferences of typical CNN tasks. Furthermore, we conduct in situ training and get higher SDR at 36.27 dBc, verifying the OCU could be further refined by in situ training. Besides the effectiveness and high accuracy, the simplified OCU architecture served as a building block could be easily duplicated and integrated to future chip-scale optical CNNs.
Collapse
|
124
|
Li F, Akagi M. Blind monaural singing voice separation using rank-1 constraint robust principal component analysis and vocal activity detection. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.04.030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
125
|
Pandey A, Wang D. A New Framework for CNN-Based Speech Enhancement in the Time Domain. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2019; 27:1179-1188. [PMID: 34262993 PMCID: PMC8276831 DOI: 10.1109/taslp.2019.2913512] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
This paper proposes a new learning mechanism for a fully convolutional neural network (CNN) to address speech enhancement in the time domain. The CNN takes as input the time frames of noisy utterance and outputs the time frames of the enhanced utterance. At the training time, we add an extra operation that converts the time domain to the frequency domain. This conversion corresponds to simple matrix multiplication, and is hence differentiable implying that a frequency domain loss can be used for training in the time domain. We use mean absolute error loss between the enhanced short-time Fourier transform (STFT) magnitude and the clean STFT magnitude to train the CNN. This way, the model can exploit the domain knowledge of converting a signal to the frequency domain for analysis. Moreover, this approach avoids the well-known invalid STFT problem since the proposed CNN operates in the time domain. Experimental results demonstrate that the proposed method substantially outperforms the other methods of speech enhancement. The proposed method is easy to implement and applicable to related speech processing tasks that require time-frequency masking or spectral mapping.
Collapse
Affiliation(s)
- Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
126
|
Feng S, Ren W, Han M, Chen YW. Robust manifold broad learning system for large-scale noisy chaotic time series prediction: A perturbation perspective. Neural Netw 2019; 117:179-190. [PMID: 31170577 DOI: 10.1016/j.neunet.2019.05.009] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Revised: 05/07/2019] [Accepted: 05/09/2019] [Indexed: 11/28/2022]
Abstract
Noises and outliers commonly exist in dynamical systems because of sensor disturbations or extreme dynamics. Thus, the robustness and generalization capacity are of vital importance for system modeling. In this paper, the robust manifold broad learning system(RM-BLS) is proposed for system modeling and large-scale noisy chaotic time series prediction. Manifold embedding is utilized for chaotic system evolution discovery. The manifold representation is randomly corrupted by perturbations while the features not related to low-dimensional manifold embedding are discarded by feature selection. It leads to a robust learning paradigm and achieves better generalization performance. We also develop an efficient solution for Stiefel manifold optimization, in which the orthogonal constraints are maintained by Cayley transformation and curvilinear search algorithm. Furthermore, we discuss the common thoughts between random perturbation approximation and other mainstream regularization methods. We also prove the equivalence between perturbations to manifold embedding and Tikhonov regularization. Simulation results on large-scale noisy chaotic time series prediction illustrates the robustness and generalization performance of our method.
Collapse
Affiliation(s)
- Shoubo Feng
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning 116024, China.
| | - Weijie Ren
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning 116024, China.
| | - Min Han
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning 116024, China.
| | - Yen Wei Chen
- Graduate School of Information Science and Engineering, Ritsumeikan University, Shiga, Japan.
| |
Collapse
|
127
|
Han C, O’Sullivan J, Luo Y, Herrero J, Mehta AD, Mesgarani N. Speaker-independent auditory attention decoding without access to clean speech sources. SCIENCE ADVANCES 2019; 5:eaav6134. [PMID: 31106271 PMCID: PMC6520028 DOI: 10.1126/sciadv.aav6134] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2018] [Accepted: 04/09/2019] [Indexed: 06/08/2023]
Abstract
Speech perception in crowded environments is challenging for hearing-impaired listeners. Assistive hearing devices cannot lower interfering speakers without knowing which speaker the listener is focusing on. One possible solution is auditory attention decoding in which the brainwaves of listeners are compared with sound sources to determine the attended source, which can then be amplified to facilitate hearing. In realistic situations, however, only mixed audio is available. We utilize a novel speech separation algorithm to automatically separate speakers in mixed audio, with no need for the speakers to have prior training. Our results show that auditory attention decoding with automatically separated speakers is as accurate and fast as using clean speech sounds. The proposed method significantly improves the subjective and objective quality of the attended speaker. Our study addresses a major obstacle in actualization of auditory attention decoding that can assist hearing-impaired listeners and reduce listening effort for normal-hearing subjects.
Collapse
Affiliation(s)
- Cong Han
- Department of Electrical Engineering, Columbia University, New York, NY, USA
- Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA
| | - James O’Sullivan
- Department of Electrical Engineering, Columbia University, New York, NY, USA
- Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA
| | - Yi Luo
- Department of Electrical Engineering, Columbia University, New York, NY, USA
- Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA
| | - Jose Herrero
- Department of Neurosurgery, Hofstra-Northwell School of Medicine and Feinstein Institute for Medical Research, Manhasset, New York, NY, USA
| | - Ashesh D. Mehta
- Department of Neurosurgery, Hofstra-Northwell School of Medicine and Feinstein Institute for Medical Research, Manhasset, New York, NY, USA
| | - Nima Mesgarani
- Department of Electrical Engineering, Columbia University, New York, NY, USA
- Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA
| |
Collapse
|
128
|
Zhao Y, Wang ZQ, Wang D. Two-stage Deep Learning for Noisy-reverberant Speech Enhancement. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2019; 27:53-62. [PMID: 31106230 PMCID: PMC6519714 DOI: 10.1109/taslp.2018.2870725] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
In real-world situations, speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also pose a serious problem to many speech-related applications, including automatic speech and speaker recognition. In order to deal with the combined effects of noise and reverberation, we propose a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes, which would in turn yield better phase estimates when combined with iterative phase reconstruction. The two-stage model is then jointly trained to optimize the proposed objective function. Systematic evaluations and comparisons show that the proposed algorithm improves objective metrics of speech intelligibility and quality substantially, and significantly outperforms previous one-stage enhancement systems.
Collapse
Affiliation(s)
- Yan Zhao
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210 USA.
| | - Zhong-Qiu Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210 USA.
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH, 43210 USA. He also holds a visiting appointment at the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University, Xi'an, China.
| |
Collapse
|
129
|
Mahmmod BM, Ramli AR, Baker T, Al-Obeidat F, Abdulhussain SH, Jassim WA. Speech Enhancement Algorithm Based on Super-Gaussian Modeling and Orthogonal Polynomials. IEEE ACCESS 2019; 7:103485-103504. [DOI: 10.1109/access.2019.2929864] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|