1
|
Islam MA, Xu Y, Monk T, Afshar S, van Schaik A. Noise-robust text-dependent speaker identification using cochlear models. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 151:500. [PMID: 35105043 DOI: 10.1121/10.0009314] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 12/27/2021] [Indexed: 06/14/2023]
Abstract
One challenging issue in speaker identification (SID) is to achieve noise-robust performance. Humans can accurately identify speakers, even in noisy environments. We can leverage our knowledge of the function and anatomy of the human auditory pathway to design SID systems that achieve better noise-robust performance than conventional approaches. We propose a text-dependent SID system based on a real-time cochlear model called cascade of asymmetric resonators with fast-acting compression (CARFAC). We investigate the SID performance of CARFAC on signals corrupted by noise of various types and levels. We compare its performance with conventional auditory feature generators including mel-frequency cepstrum coefficients, frequency domain linear predictions, as well as another biologically inspired model called the auditory nerve model. We show that CARFAC outperforms other approaches when signals are corrupted by noise. Our results are consistent across datasets, types and levels of noise, different speaking speeds, and back-end classifiers. We show that the noise-robust SID performance of CARFAC is largely due to its nonlinear processing of auditory input signals. Presumably, the human auditory system achieves noise-robust performance via inherent nonlinearities as well.
Collapse
Affiliation(s)
- Md Atiqul Islam
- International Centre for Neuromorphic Systems in the MARCS Institute for Brain, Behaviour, and Development, Western Sydney University, Penrith, New South Wales, 2751, Australia
| | - Ying Xu
- International Centre for Neuromorphic Systems in the MARCS Institute for Brain, Behaviour, and Development, Western Sydney University, Penrith, New South Wales, 2751, Australia
| | - Travis Monk
- International Centre for Neuromorphic Systems in the MARCS Institute for Brain, Behaviour, and Development, Western Sydney University, Penrith, New South Wales, 2751, Australia
| | - Saeed Afshar
- International Centre for Neuromorphic Systems in the MARCS Institute for Brain, Behaviour, and Development, Western Sydney University, Penrith, New South Wales, 2751, Australia
| | - André van Schaik
- International Centre for Neuromorphic Systems in the MARCS Institute for Brain, Behaviour, and Development, Western Sydney University, Penrith, New South Wales, 2751, Australia
| |
Collapse
|
2
|
Espín López JM, Huertas Celdrán A, Marín-Blázquez JG, Esquembre F, Martínez Pérez G. S3: An AI-Enabled User Continuous Authentication for Smartphones Based on Sensors, Statistics and Speaker Information. SENSORS 2021; 21:s21113765. [PMID: 34071655 PMCID: PMC8199259 DOI: 10.3390/s21113765] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 05/21/2021] [Accepted: 05/24/2021] [Indexed: 11/21/2022]
Abstract
Continuous authentication systems have been proposed as a promising solution to authenticate users in smartphones in a non-intrusive way. However, current systems have important weaknesses related to the amount of data or time needed to build precise user profiles, together with high rates of false alerts. Voice is a powerful dimension for identifying subjects but its suitability and importance have not been deeply analyzed regarding its inclusion in continuous authentication systems. This work presents the S3 platform, an artificial intelligence-enabled continuous authentication system that combines data from sensors, applications statistics and voice to authenticate users in smartphones. Experiments have tested the relevance of each kind of data, explored different strategies to combine them, and determined how many days of training are needed to obtain good enough profiles. Results showed that voice is much more relevant than sensors and applications statistics when building a precise authenticating system, and the combination of individual models was the best strategy. Finally, the S3 platform reached a good performance with only five days of use available for training the users’ profiles. As an additional contribution, a dataset with 21 volunteers interacting freely with their smartphones for more than sixty days has been created and made available to the community.
Collapse
Affiliation(s)
- Juan Manuel Espín López
- Department of Information and Communications Engineering (DIIC), University of Murcia, 30100 Murcia, Spain; (J.M.E.L.); (G.M.P.)
| | - Alberto Huertas Celdrán
- Communication Systems Group (CSG), Department of Informatics (IfI), University of Zürich UZH, CH-8050 Zürich, Switzerland;
| | - Javier G. Marín-Blázquez
- Department of Information and Communications Engineering (DIIC), University of Murcia, 30100 Murcia, Spain; (J.M.E.L.); (G.M.P.)
- Correspondence: ; Tel.: +34-868-88-76-46
| | | | - Gregorio Martínez Pérez
- Department of Information and Communications Engineering (DIIC), University of Murcia, 30100 Murcia, Spain; (J.M.E.L.); (G.M.P.)
| |
Collapse
|
3
|
Pal M, Kumar M, Peri R, Park TJ, Kim SH, Lord C, Bishop S, Narayanan S. Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2021; 29:1204-1219. [PMID: 33997106 PMCID: PMC8118028 DOI: 10.1109/taslp.2021.3061885] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN network to improve diarization robustness and enable rapid generalization across various challenging domains. To this end, we fetch the pre-trained encoder from the ClusterGAN and fine tune it by using prototypical loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD-II (dev set) which includes challenging multi-domain corpus, and two child-clinician interaction corpora (ADOS, BOSCC) related to the autism spectrum disorder domain. Extensive analyses of the experimental data are done to investigate the effectiveness of the proposed ClusterGAN and MCGAN embeddings over x-vectors. The results show that the proposed embeddings with normalized maximum eigengap spectral clustering (NME-SC) back-end consistently outperform the Kaldi state-of-the-art x-vector diarization system. Finally, we employ embedding fusion with x-vectors to provide further improvement in diarization performance. We achieve a relative diarization error rate (DER) improvement of 6.67% to 53.93% on the aforementioned datasets using the proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide better performance in the number of speakers estimation and short speech segment diarization compared to x-vectors and ClusterGAN on telephonic conversations.
Collapse
Affiliation(s)
- Monisankha Pal
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| | - Manoj Kumar
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| | - Raghuveer Peri
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| | - Tae Jin Park
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| | - So Hyun Kim
- Center for Autism and the Developing Brain, Weill Cornell Medicine, USA
| | - Catherine Lord
- Semel Institute of Neuroscience and Human Behavior, University of California Los Angeles, USA
| | - Somer Bishop
- Department of Psychiatry, University of California, San Francisco, USA
| | - Shrikanth Narayanan
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| |
Collapse
|
4
|
Kaur R, Sharma R, Kumar P. Speaker Classification with Support Vector Machine and Crossover-Based Particle Swarm Optimization. INT J PATTERN RECOGN 2020. [DOI: 10.1142/s0218001420510106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
It has been observed from the literature that speech is the most natural means of communication between humans. Human beings start speaking without any tool or any explicit education. The environment surrounding them helps them to learn the art of speaking. From the existing literature, it is found that the existing speaker classification techniques suffer from over-fitting and parameter tuning issues. An efficient tuning of machine learning techniques can improve the classification accuracy of speaker classification. To overcome this issue, in this paper, an efficient particle swarm optimization-based support vector machine is proposed. The proposed and the competitive speaker classification techniques are tested on the speaker classification data of Punjabi persons. The comparative analysis of the proposed technique reveals that it outperforms existing techniques in terms of accuracy, [Formula: see text]-measure, specificity and sensitivity.
Collapse
Affiliation(s)
- Rupinderdeep Kaur
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, Punjab, India
| | - R. K. Sharma
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, Punjab, India
| | - Parteek Kumar
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, Punjab, India
| |
Collapse
|
5
|
|
6
|
Restricted Boltzmann Machine Vectors for Speaker Clustering and Tracking Tasks in TV Broadcast Shows. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9132761] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Restricted Boltzmann Machines (RBMs) have shown success in both the front-end and backend of speaker verification systems. In this paper, we propose applying RBMs to the front-end for the tasks of speaker clustering and speaker tracking in TV broadcast shows. RBMs are trained to transform utterances into a vector based representation. Because of the lack of data for a test speaker, we propose RBM adaptation to a global model. First, the global model—which is referred to as universal RBM—is trained with all the available background data. Then an adapted RBM model is trained with the data of each test speaker. The visible to hidden weight matrices of the adapted models are concatenated along with the bias vectors and are whitened to generate the vector representation of speakers. These vectors, referred to as RBM vectors, were shown to preserve speaker-specific information and are used in the tasks of speaker clustering and speaker tracking. The evaluation was performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed speaker clustering system gained up to 12% relative improvement, in terms of Equal Impurity (EI), over the baseline system. On the other hand, in the task of speaker tracking, our system has a relative improvement of 11% and 7% compared to the baseline system using cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring, respectively.
Collapse
|
7
|
Chen H, Jiao L, Liang M, Liu F, Yang S, Hou B. Fast unsupervised deep fusion network for change detection of multitemporal SAR images. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.11.077] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
8
|
Hoori AO, Motai Y. Multicolumn RBF Network. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:766-778. [PMID: 28113352 DOI: 10.1109/tnnls.2017.2650865] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This paper proposes the multicolumn RBF network (MCRN) as a method to improve the accuracy and speed of a traditional radial basis function network (RBFN). The RBFN, as a fully connected artificial neural network (ANN), suffers from costly kernel inner-product calculations due to the use of many instances as the centers of hidden units. This issue is not critical for small datasets, as adding more hidden units will not burden the computation time. However, for larger datasets, the RBFN requires many hidden units with several kernel computations to generalize the problem. The MCRN mechanism is constructed based on dividing a dataset into smaller subsets using the k-d tree algorithm. resultant subsets are considered as separate training datasets to train individual RBFNs. Those small RBFNs are stacked in parallel and bulged into the MCRN structure during testing. The MCRN is considered as a well-developed and easy-to-use parallel structure, because each individual ANN has been trained on its own subsets and is completely separate from the other ANNs. This parallelized structure reduces the testing time compared with that of a single but larger RBFN, which cannot be easily parallelized due to its fully connected structure. Small informative subsets provide the MCRN with a regional experience to specify the problem instead of generalizing it. The MCRN has been tested on many benchmark datasets and has shown better accuracy and great improvements in training and testing times compared with a single RBFN. The MCRN also shows good results compared with those of some machine learning techniques, such as the support vector machine and k-nearest neighbors.
Collapse
|
9
|
Imposing Class-Wise Feature Similarity in Stacked Autoencoders by Nuclear Norm Regularization. Neural Process Lett 2017. [DOI: 10.1007/s11063-017-9731-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
10
|
Li J, Zhang T, Luo W, Yang J, Yuan XT, Zhang J. Sparseness Analysis in the Pretraining of Deep Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:1425-1438. [PMID: 27046912 DOI: 10.1109/tnnls.2016.2541681] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A major progress in deep multilayer neural networks (DNNs) is the invention of various unsupervised pretraining methods to initialize network parameters which lead to good prediction accuracy. This paper presents the sparseness analysis on the hidden unit in the pretraining process. In particular, we use the L1 -norm to measure sparseness and provide some sufficient conditions for that pretraining leads to sparseness with respect to the popular pretraining models-such as denoising autoencoders (DAEs) and restricted Boltzmann machines (RBMs). Our experimental results demonstrate that when the sufficient conditions are satisfied, the pretraining models lead to sparseness. Our experiments also reveal that when using the sigmoid activation functions, pretraining plays an important sparseness role in DNNs with sigmoid (Dsigm), and when using the rectifier linear unit (ReLU) activation functions, pretraining becomes less effective for DNNs with ReLU (Drelu). Luckily, Drelu can reach a higher recognition accuracy than DNNs with pretraining (DAEs and RBMs), as it can capture the main benefit (such as sparseness-encouraging) of pretraining in Dsigm. However, ReLU is not adapted to the different firing rates in biological neurons, because the firing rate actually changes along with the varying membrane resistances. To address this problem, we further propose a family of rectifier piecewise linear units (RePLUs) to fit the different firing rates. The experimental results show that the performance of RePLU is better than ReLU, and is comparable with those with some pretraining techniques, such as RBMs and DAEs.
Collapse
|
11
|
A deep belief network to predict the hot deformation behavior of a Ni-based superalloy. Neural Comput Appl 2016. [DOI: 10.1007/s00521-016-2635-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
12
|
Cohen E, Malka D, Shemer A, Shahmoon A, Zalevsky Z, London M. Neural networks within multi-core optic fibers. Sci Rep 2016; 6:29080. [PMID: 27383911 PMCID: PMC4935875 DOI: 10.1038/srep29080] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2015] [Accepted: 06/10/2016] [Indexed: 11/15/2022] Open
Abstract
Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks.
Collapse
Affiliation(s)
- Eyal Cohen
- Life Science Institute, Hebrew University, Jerusalem, Israel.,Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel
| | - Dror Malka
- Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.,Faculty of Engineering, Holon Institute of Technology, Holon, Israel
| | - Amir Shemer
- Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel
| | - Asaf Shahmoon
- Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel
| | - Zeev Zalevsky
- Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel
| | - Michael London
- Life Science Institute, Hebrew University, Jerusalem, Israel.,The Edmond and Lily Safra Center for Brain Sciences, Hebrew University, Jerusalem, Israel
| |
Collapse
|
13
|
Learning contextualized semantics from co-occurring terms via a Siamese architecture. Neural Netw 2016; 76:65-96. [DOI: 10.1016/j.neunet.2016.01.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Revised: 10/28/2015] [Accepted: 01/13/2016] [Indexed: 11/24/2022]
|
14
|
Ren Z, Deng Y, Dai Q. Local visual feature fusion via maximum margin multimodal deep neural network. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.10.076] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
15
|
Gong M, Liu J, Li H, Cai Q, Su L. A Multiobjective Sparse Feature Learning Model for Deep Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:3263-3277. [PMID: 26340790 DOI: 10.1109/tnnls.2015.2469673] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Hierarchical deep neural networks are currently popular learning models for imitating the hierarchical architecture of human brain. Single-layer feature extractors are the bricks to build deep networks. Sparse feature learning models are popular models that can learn useful representations. But most of those models need a user-defined constant to control the sparsity of representations. In this paper, we propose a multiobjective sparse feature learning model based on the autoencoder. The parameters of the model are learnt by optimizing two objectives, reconstruction error and the sparsity of hidden units simultaneously to find a reasonable compromise between them automatically. We design a multiobjective induced learning procedure for this model based on a multiobjective evolutionary algorithm. In the experiments, we demonstrate that the learning procedure is effective, and the proposed multiobjective model can learn useful sparse features.
Collapse
|
16
|
Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw 2014; 61:85-117. [PMID: 25462637 DOI: 10.1016/j.neunet.2014.09.003] [Citation(s) in RCA: 3914] [Impact Index Per Article: 355.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2014] [Revised: 09/12/2014] [Accepted: 09/14/2014] [Indexed: 11/30/2022]
Abstract
In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarizes relevant work, much of it from the previous millennium. Shallow and Deep Learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
Collapse
Affiliation(s)
- Jürgen Schmidhuber
- Swiss AI Lab IDSIA, Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, University of Lugano & SUPSI, Galleria 2, 6928 Manno-Lugano, Switzerland.
| |
Collapse
|
17
|
|