1
|
Zheng Q, Yang X, Wang S, An X, Liu Q. Asymmetric double-winged multi-view clustering network for exploring diverse and consistent information. Neural Netw 2024; 179:106563. [PMID: 39111164 DOI: 10.1016/j.neunet.2024.106563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 05/24/2024] [Accepted: 07/20/2024] [Indexed: 09/18/2024]
Abstract
In unsupervised scenarios, deep contrastive multi-view clustering (DCMVC) is becoming a hot research spot, which aims to mine the potential relationships between different views. Most existing DCMVC algorithms focus on exploring the consistency information for the deep semantic features, while ignoring the diverse information on shallow features. To fill this gap, we propose a novel multi-view clustering network termed CodingNet to explore the diverse and consistent information simultaneously in this paper. Specifically, instead of utilizing the conventional auto-encoder, we design an asymmetric structure network to extract shallow and deep features separately. Then, by approximating the similarity matrix on the shallow feature to the zero matrix, we ensure the diversity for the shallow features, thus offering a better description of multi-view data. Moreover, we propose a dual contrastive mechanism that maintains consistency for deep features at both view-feature and pseudo-label levels. Our framework's efficacy is validated through extensive experiments on six widely used benchmark datasets, outperforming most state-of-the-art multi-view clustering algorithms.
Collapse
Affiliation(s)
- Qun Zheng
- School of Earth and Space Sciences, CMA-USTC Laboratory of Fengyun Remote Sensing, University of Science and Technology of China, Hefei 230026, China
| | - Xihong Yang
- College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
| | - Siwei Wang
- College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
| | - Xinru An
- School of Earth and Space Sciences, CMA-USTC Laboratory of Fengyun Remote Sensing, University of Science and Technology of China, Hefei 230026, China
| | - Qi Liu
- School of Earth and Space Sciences, CMA-USTC Laboratory of Fengyun Remote Sensing, University of Science and Technology of China, Hefei 230026, China.
| |
Collapse
|
2
|
Sharma M, Joshi S, Chatterjee T, Hamid R. A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.084] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
3
|
Pan TY, Tsai WL, Chang CY, Yeh CW, Hu MC. A Hierarchical Hand Gesture Recognition Framework for Sports Referee Training-Based EMG and Accelerometer Sensors. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3172-3183. [PMID: 32776885 DOI: 10.1109/tcyb.2020.3007173] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
To cultivate professional sports referees, we develop a sports referee training system, which can recognize whether a trainee wearing the Myo armband makes correct judging signals while watching a prerecorded professional game. The system has to correctly recognize a set of gestures related to official referee's signals (ORSs) and another set of gestures used to intuitively interact with the system. These two gesture sets involve both large motion and subtle motion gestures, and the existing sensor-based methods using handcrafted features do not work well on recognizing all kinds of these gestures. In this work, deep belief networks (DBNs) are utilized to learn more representative features for hand gesture recognition, and selective handcrafted features are combined with the DBN features to achieve more robust recognition results. Moreover, a hierarchical recognition scheme is designed to first recognize the input gesture as a large or subtle motion gesture, and the corresponding classifiers for large motion gestures and subtle motion gestures are further used to obtain the final recognition result. Moreover, the Myo armband consists of eight-channel surface electromyography (sEMG) sensors and an inertial measurement unit (IMU), and these heterogeneous signals can be fused to achieve better recognition accuracy. We take basketball as an example to validate the proposed training system, and the experimental results show that the proposed hierarchical scheme considering DBN features of multimodality data outperforms other methods.
Collapse
|
4
|
|
5
|
ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment. Neural Process Lett 2022. [DOI: 10.1007/s11063-021-10695-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
6
|
Yang L, Zhao C, Lu C, Wei L, Gong J. Lateral and Longitudinal Driving Behavior Prediction Based on Improved Deep Belief Network. SENSORS 2021; 21:s21248498. [PMID: 34960592 PMCID: PMC8706022 DOI: 10.3390/s21248498] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Revised: 12/15/2021] [Accepted: 12/16/2021] [Indexed: 11/16/2022]
Abstract
Accurately predicting driving behavior can help to avoid potential improper maneuvers of human drivers, thus guaranteeing safe driving for intelligent vehicles. In this paper, we propose a novel deep belief network (DBN), called MSR-DBN, by integrating a multi-target sigmoid regression (MSR) layer with DBN to predict the front wheel angle and speed of the ego vehicle. Precisely, the MSR-DBN consists of two sub-networks: one is for the front wheel angle, and the other one is for speed. This MSR-DBN model allows ones to optimize lateral and longitudinal behavior predictions through a systematic testing method. In addition, we consider the historical states of the ego vehicle and surrounding vehicles and the driver's operations as inputs to predict driving behaviors in a real-world environment. Comparison of the prediction results of MSR-DBN with a general DBN model, back propagation (BP) neural network, support vector regression (SVR), and radical basis function (RBF) neural network, demonstrates that the proposed MSR-DBN outperforms the others in terms of accuracy and robustness.
Collapse
Affiliation(s)
- Lei Yang
- School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China; (L.Y.); (C.L.); (L.W.)
| | - Chunqing Zhao
- China North Vehicle Research Institute, Beijing 100072, China;
| | - Chao Lu
- School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China; (L.Y.); (C.L.); (L.W.)
| | - Lianzhen Wei
- School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China; (L.Y.); (C.L.); (L.W.)
- Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing 314019, China
| | - Jianwei Gong
- School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China; (L.Y.); (C.L.); (L.W.)
- Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing 314019, China
- Correspondence:
| |
Collapse
|
7
|
Atkins A, Cohen I, Benesty J. Adaptive line enhancer for nonstationary harmonic noise reduction. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2021.101245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
8
|
Gu Y, Chi J, Liu J, Yang L, Zhang B, Yu D, Zhao Y, Lu X. A survey of computer-aided diagnosis of lung nodules from CT scans using deep learning. Comput Biol Med 2021; 137:104806. [PMID: 34461501 DOI: 10.1016/j.compbiomed.2021.104806] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 08/23/2021] [Accepted: 08/23/2021] [Indexed: 12/17/2022]
Abstract
Lung cancer has one of the highest mortalities of all cancers. According to the National Lung Screening Trial, patients who underwent low-dose computed tomography (CT) scanning once a year for 3 years showed a 20% decline in lung cancer mortality. To further improve the survival rate of lung cancer patients, computer-aided diagnosis (CAD) technology shows great potential. In this paper, we summarize existing CAD approaches applying deep learning to CT scan data for pre-processing, lung segmentation, false positive reduction, lung nodule detection, segmentation, classification and retrieval. Selected papers are drawn from academic journals and conferences up to November 2020. We discuss the development of deep learning, describe several important aspects of lung nodule CAD systems and assess the performance of the selected studies on various datasets, which include LIDC-IDRI, LUNA16, LIDC, DSB2017, NLST, TianChi, and ELCAP. Overall, in the detection studies reviewed, the sensitivity of these techniques is found to range from 61.61% to 98.10%, and the value of the FPs per scan is between 0.125 and 32. In the selected classification studies, the accuracy ranges from 75.01% to 97.58%. The precision of the selected retrieval studies is between 71.43% and 87.29%. Based on performance, deep learning based CAD technologies for detection and classification of pulmonary nodules achieve satisfactory results. However, there are still many challenges and limitations remaining including over-fitting, lack of interpretability and insufficient annotated data. This review helps researchers and radiologists to better understand CAD technology for pulmonary nodule detection, segmentation, classification and retrieval. We summarize the performance of current techniques, consider the challenges, and propose directions for future high-impact research.
Collapse
Affiliation(s)
- Yu Gu
- Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, 014010, China.
| | - Jingqian Chi
- Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, 014010, China.
| | - Jiaqi Liu
- Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Lidong Yang
- Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Baohua Zhang
- Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Dahua Yu
- Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Ying Zhao
- Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Xiaoqi Lu
- Inner Mongolia Key Laboratory of Pattern Recognition and Intelligent Image Processing, School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, 014010, China; College of Information Engineering, Inner Mongolia University of Technology, Hohhot, 010051, China
| |
Collapse
|
9
|
|
10
|
Sun G, Zhang C, Woodland PC. Combination of deep speaker embeddings for diarisation. Neural Netw 2021; 141:372-384. [PMID: 33984663 DOI: 10.1016/j.neunet.2021.04.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 03/10/2021] [Accepted: 04/15/2021] [Indexed: 11/29/2022]
Abstract
Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components. Three structures are used to implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear pooling structures, relying on attention mechanisms, a gating mechanism, and a low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based single-pass speaker diarisation pipeline is also proposed in this paper, which uses NNs to achieve voice activity detection, speaker change point detection, and speaker embedding extraction. Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets which consist of real meetings with 4-10 speakers and a wide range of acoustic conditions. For systems trained on the AMI training set, relative speaker error rate (SER) reductions of 13% and 29% are obtained by using c-vectors instead of d-vectors on the AMI dev and eval sets respectively, and a relative SER reduction of 15% in SER is observed on RT05, which shows the robustness of the proposed methods. By incorporating VoxCeleb data into the training set, the best c-vector system achieved 7%, 17% and 16% relative SER reduction compared to the d-vector on the AMI dev, eval and RT05 sets respectively.
Collapse
Affiliation(s)
- Guangzhi Sun
- Cambridge University Engineering Department, Trumpington Street, Cambridge, CB2 1PZ, UK.
| | - Chao Zhang
- Cambridge University Engineering Department, Trumpington Street, Cambridge, CB2 1PZ, UK.
| | - Philip C Woodland
- Cambridge University Engineering Department, Trumpington Street, Cambridge, CB2 1PZ, UK.
| |
Collapse
|
11
|
Gupta S, Patil AT, Purohit M, Parmar M, Patel M, Patil HA, Guido RC. Residual Neural Network precisely quantifies dysarthria severity-level based on short-duration speech segments. Neural Netw 2021; 139:105-117. [PMID: 33684609 DOI: 10.1016/j.neunet.2021.02.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 01/24/2021] [Accepted: 02/08/2021] [Indexed: 10/22/2022]
Abstract
Recently, we have witnessed Deep Learning methodologies gaining significant attention for severity-based classification of dysarthric speech. Detecting dysarthria, quantifying its severity, are of paramount importance in various real-life applications, such as the assessment of patients' progression in treatments, which includes an adequate planning of their therapy and the improvement of speech-based interactive systems in order to handle pathologically-affected voices automatically. Notably, current speech-powered tools often deal with short-duration speech segments and, consequently, are less efficient in dealing with impaired speech, even by using Convolutional Neural Networks (CNNs). Thus, detecting dysarthria severity-level based on short speech segments might help in improving the performance and applicability of those systems. To achieve this goal, we propose a novel Residual Network (ResNet)-based technique which receives short-duration speech segments as input. Statistically meaningful objective analysis of our experiments, reported over standard Universal Access corpus, exhibits average values of 21.35% and 22.48% improvement, compared to the baseline CNN, in terms of classification accuracy and F1-score, respectively. For additional comparisons, tests with Gaussian Mixture Models and Light CNNs were also performed. Overall, the values of 98.90% and 98.00% for classification accuracy and F1-score, respectively, were obtained with the proposed ResNet approach, confirming its efficacy and reassuring its practical applicability.
Collapse
Affiliation(s)
- Siddhant Gupta
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | - Ankur T Patil
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | - Mirali Purohit
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | | | - Maitreya Patel
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | - Hemant A Patil
- Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar 382007, India
| | - Rodrigo Capobianco Guido
- Instituto de Biociências, Letras e Ciências Exatas, Unesp - Univ Estadual Paulista (São Paulo State University), Rua Cristóvão Colombo 2265, Jd Nazareth, 15054-000, São José do Rio Preto - SP, Brazil.
| |
Collapse
|
12
|
Pannala V, Yegnanarayana B. A neural network approach for speech activity detection for Apollo corpus. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2020.101137] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
13
|
Dubey SR, Chakraborty S, Roy SK, Mukherjee S, Singh SK, Chaudhuri BB. diffGrad: An Optimization Method for Convolutional Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:4500-4511. [PMID: 31880565 DOI: 10.1109/tnnls.2019.2955777] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad.
Collapse
|
14
|
Nadeem MW, Goh HG, Ali A, Hussain M, Khan MA, Ponnusamy VA. Bone Age Assessment Empowered with Deep Learning: A Survey, Open Research Challenges and Future Directions. Diagnostics (Basel) 2020; 10:E781. [PMID: 33022947 PMCID: PMC7601134 DOI: 10.3390/diagnostics10100781] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Revised: 09/06/2020] [Accepted: 09/21/2020] [Indexed: 12/12/2022] Open
Abstract
Deep learning is a quite useful and proliferating technique of machine learning. Various applications, such as medical images analysis, medical images processing, text understanding, and speech recognition, have been using deep learning, and it has been providing rather promising results. Both supervised and unsupervised approaches are being used to extract and learn features as well as for the multi-level representation of pattern recognition and classification. Hence, the way of prediction, recognition, and diagnosis in various domains of healthcare including the abdomen, lung cancer, brain tumor, skeletal bone age assessment, and so on, have been transformed and improved significantly by deep learning. By considering a wide range of deep-learning applications, the main aim of this paper is to present a detailed survey on emerging research of deep-learning models for bone age assessment (e.g., segmentation, prediction, and classification). An enormous number of scientific research publications related to bone age assessment using deep learning are explored, studied, and presented in this survey. Furthermore, the emerging trends of this research domain have been analyzed and discussed. Finally, a critical discussion section on the limitations of deep-learning models has been presented. Open research challenges and future directions in this promising area have been included as well.
Collapse
Affiliation(s)
- Muhammad Waqas Nadeem
- Faculty of Information and Communication Technology (FICT), Universiti Tunku Abdul Rahman (UTAR), 31900 Kampar, Perak, Malaysia;
- Department of Computer Science, Lahore Garrison University, Lahore 54000, Pakistan; (A.A.); (M.A.K.)
| | - Hock Guan Goh
- Faculty of Information and Communication Technology (FICT), Universiti Tunku Abdul Rahman (UTAR), 31900 Kampar, Perak, Malaysia;
| | - Abid Ali
- Department of Computer Science, Lahore Garrison University, Lahore 54000, Pakistan; (A.A.); (M.A.K.)
| | - Muzammil Hussain
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore 54000, Pakistan;
| | - Muhammad Adnan Khan
- Department of Computer Science, Lahore Garrison University, Lahore 54000, Pakistan; (A.A.); (M.A.K.)
| | - Vasaki a/p Ponnusamy
- Faculty of Information and Communication Technology (FICT), Universiti Tunku Abdul Rahman (UTAR), 31900 Kampar, Perak, Malaysia;
| |
Collapse
|
15
|
Korkmaz Y, Boyacı A. Unsupervised and supervised VAD systems using combination of time and frequency domain features. Biomed Signal Process Control 2020. [DOI: 10.1016/j.bspc.2020.102044] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
16
|
Raphael A, Dubinsky Z, Iluz D, Benichou JIC, Netanyahu NS. Deep neural network recognition of shallow water corals in the Gulf of Eilat (Aqaba). Sci Rep 2020; 10:12959. [PMID: 32737327 PMCID: PMC7395127 DOI: 10.1038/s41598-020-69201-w] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 07/01/2020] [Indexed: 11/08/2022] Open
Abstract
We describe the application of the computerized deep learning methodology to the recognition of corals in a shallow reef in the Gulf of Eilat, Red Sea. This project is aimed at applying deep neural network analysis, based on thousands of underwater images, to the automatic recognition of some common species among the 100 species reported to be found in the Eilat coral reefs. This is a challenging task, since even in the same colony, corals exhibit significant within-species morphological variability, in terms of age, depth, current, light, geographic location, and inter-specific competition. Since deep learning procedures are based on photographic images, the task is further challenged by image quality, distance from the object, angle of view, and light conditions. We produced a large dataset of over 5,000 coral images that were classified into 11 species in the present automated deep learning classification scheme. We demonstrate the efficiency and reliability of the method, as compared to painstaking manual classification. Specifically, we demonstrated that this method is readily adaptable to include additional species, thereby providing an excellent tool for future studies in the region, that would allow for real time monitoring the detrimental effects of global climate change and anthropogenic impacts on the coral reefs of the Gulf of Eilat and elsewhere, and that would help assess the success of various bioremediation efforts.
Collapse
Affiliation(s)
- Alina Raphael
- Faculty of Life Sciences, The Mina and Everard Goodman, Bar-Ilan University, 5290002, Ramat-Gan, Israel.
| | - Zvy Dubinsky
- Faculty of Life Sciences, The Mina and Everard Goodman, Bar-Ilan University, 5290002, Ramat-Gan, Israel
| | - David Iluz
- Faculty of Life Sciences, The Mina and Everard Goodman, Bar-Ilan University, 5290002, Ramat-Gan, Israel
- Department of Environmental Sciences and Agriculture, Beit Berl College, 4490500, Beit Berl, Israel
| | - Jennifer I C Benichou
- Faculty of Life Sciences, The Mina and Everard Goodman, Bar-Ilan University, 5290002, Ramat-Gan, Israel
| | - Nathan S Netanyahu
- Department of Computer Science, Bar-Ilan University, 5290002, Ramat-Gan, Israel
| |
Collapse
|
17
|
Bittencourt II, Cukurova M, Muldner K, Luckin R, Millán E. Neural Multi-task Learning for Teacher Question Detection in Online Classrooms. ACTA ACUST UNITED AC 2020. [PMCID: PMC7334151 DOI: 10.1007/978-3-030-52237-7_22] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
18
|
Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing. SENSORS 2020; 20:s20102948. [PMID: 32456031 PMCID: PMC7287755 DOI: 10.3390/s20102948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 05/18/2020] [Accepted: 05/20/2020] [Indexed: 11/16/2022]
Abstract
Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.
Collapse
|
19
|
Robust Audio Content Classification Using Hybrid-Based SMD and Entropy-Based VAD. ENTROPY 2020; 22:e22020183. [PMID: 33285958 PMCID: PMC7516611 DOI: 10.3390/e22020183] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 01/22/2020] [Accepted: 01/28/2020] [Indexed: 11/17/2022]
Abstract
A robust approach for the application of audio content classification (ACC) is proposed in this paper, especially in variable noise-level conditions. We know that speech, music, and background noise (also called silence) are usually mixed in the noisy audio signal. Based on the findings, we propose a hierarchical ACC approach consisting of three parts: voice activity detection (VAD), speech/music discrimination (SMD), and post-processing. First, entropy-based VAD is successfully used to segment input signal into noisy audio and noise even if variable-noise level is happening. The determinations of one-dimensional (1D)-subband energy information (1D-SEI) and 2D-textural image information (2D-TII) are then formed as a hybrid feature set. The hybrid-based SMD is achieved because the hybrid feature set is input into the classification of the support vector machine (SVM). Finally, a rule-based post-processing of segments is utilized to smoothly determine the output of the ACC system. The noisy audio is successfully classified into noise, speech, and music. Experimental results show that the hierarchical ACC system using hybrid feature-based SMD and entropy-based VAD is successfully evaluated against three available datasets and is comparable with existing methods even in a variable noise-level environment. In addition, our test results with the VAD scheme and hybrid features also shows that the proposed architecture increases the performance of audio content discrimination.
Collapse
|
20
|
|
21
|
|
22
|
Hwang S, Jin YG, Shin JW. Dual Microphone Voice Activity Detection Based on Reliable Spatial Cues. SENSORS 2019; 19:s19143056. [PMID: 31373308 PMCID: PMC6678508 DOI: 10.3390/s19143056] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/30/2019] [Revised: 07/05/2019] [Accepted: 07/10/2019] [Indexed: 11/16/2022]
Abstract
Two main spatial cues that can be exploited for dual microphone voice activity detection (VAD) are the interchannel time difference (ITD) and the interchannel level difference (ILD). While both ITD and ILD provide information on the location of audio sources, they may be impaired in different manners by background noises and reverberation and therefore can have complementary information. Conventional approaches utilize the statistics from all frequencies with fixed weight, although the information from some time–frequency bins may degrade the performance of VAD. In this letter, we propose a dual microphone VAD scheme based on the spatial cues in reliable frequency bins only, considering the sparsity of the speech signal in the time–frequency domain. The reliability of each time–frequency bin is determined by three conditions on signal energy, ILD, and ITD. ITD-based and ILD-based VADs and statistics are evaluated using the information from selected frequency bins and then combined to produce the final VAD results. Experimental results show that the proposed frequency selective approach enhances the performances of VAD in realistic environments.
Collapse
Affiliation(s)
- Soojoong Hwang
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdan-gwagiro, Buk-gu, Gwangju 61005, Korea
| | - Yu Gwang Jin
- AI Technology Unit, SK Telecom, 100 Eulji-ro, Jung-gu, Seoul 04551, Korea
| | - Jong Won Shin
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdan-gwagiro, Buk-gu, Gwangju 61005, Korea.
| |
Collapse
|
23
|
|
24
|
Wang D, Chen J. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2018; 26:1702-1726. [PMID: 31223631 PMCID: PMC6586438 DOI: 10.1109/taslp.2018.2842159] [Citation(s) in RCA: 121] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.
Collapse
Affiliation(s)
- DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA, and also with the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University, Xi'an 710072, China
| | - Jitong Chen
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA. He is now with Silicon Valley AI Lab, Baidu Research, Sunnyvale, CA 94089 USA
| |
Collapse
|
25
|
Bharti SS, Gupta M, Agarwal S. SVM based Voice Activity Detection by fusing a new acoustic feature PLMS with some existing acoustic features of speech. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2018. [DOI: 10.3233/jifs-169692] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Shambhu Shankar Bharti
- Department of Computer Science & Engineering, National Institute of Technology Allahabad, U.P., India
| | - Manish Gupta
- Department of Computer Science & Engineering, National Institute of Technology Allahabad, U.P., India
| | - Suneeta Agarwal
- Department of Computer Science & Engineering, National Institute of Technology Allahabad, U.P., India
| |
Collapse
|
26
|
Son Y, Lee SB, Kim H, Song ES, Huh H, Czosnyka M, Kim DJ. Automated artifact elimination of physiological signals using a deep belief network: An application for continuously measured arterial blood pressure waveforms. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.05.018] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
27
|
Kim J, Truong KP, Evers V. Automatic temporal ranking of children’s engagement levels using multi-modal cues. COMPUT SPEECH LANG 2018. [DOI: 10.1016/j.csl.2017.12.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
28
|
Zhang YJ, Huang JF, Gong N, Ling ZH, Hu Y. Automatic detection and classification of marmoset vocalizations using deep and recurrent neural networks. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:478. [PMID: 30075670 DOI: 10.1121/1.5047743] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Accepted: 07/05/2018] [Indexed: 06/08/2023]
Abstract
This paper investigates the methods to detect and classify marmoset vocalizations automatically using a large data set of marmoset vocalizations and deep learning techniques. For vocalization detection, neural networks-based methods, including deep neural network (DNN) and recurrent neural network with long short-term memory units, are designed and compared against a conventional rule-based detection method. For vocalization classification, three different classification algorithms are compared, including a support vector machine (SVM), DNN, and long short-term memory recurrent neural networks (LSTM-RNNs). A 1500-min audio data set containing recordings from four pairs of marmoset twins and manual annotations is employed for experiments. Two test sets are built according to whether the test samples are produced by the marmosets in the training set (test set I) or not (test set II). Experimental results show that the LSTM-RNN-based detection method outperformed others and achieved 0.92% and 1.67% frame error rate on these two test sets. Furthermore, the deep learning models obtained higher classification accuracy than the SVM model, which was 95.60% and 91.67% on the two test sets, respectively.
Collapse
Affiliation(s)
- Ya-Jie Zhang
- National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, 443 Huangshan Road, Hefei 230027, China
| | - Jun-Feng Huang
- Institute of Neuroscience, State Key Laboratory of Neuroscience, Chinese Academy of Sciences (CAS) Key Laboratory of Primate Neurobiology, Shanghai Institutes for Biological Sciences, CAS, 320 Yueyang Road, Shanghai 200031, China
| | - Neng Gong
- Institute of Neuroscience, State Key Laboratory of Neuroscience, Chinese Academy of Sciences (CAS) Key Laboratory of Primate Neurobiology, Shanghai Institutes for Biological Sciences, CAS, 320 Yueyang Road, Shanghai 200031, China
| | - Zhen-Hua Ling
- National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, 443 Huangshan Road, Hefei 230027, China
| | - Yu Hu
- National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, 443 Huangshan Road, Hefei 230027, China
| |
Collapse
|
29
|
Fisher Discriminative Sparse Representation Based on DBN for Fault Diagnosis of Complex System. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8050795] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
30
|
Deep Belief Network Based Hybrid Model for Building Energy Consumption Prediction. ENERGIES 2018. [DOI: 10.3390/en11010242] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
31
|
Sholokhov A, Sahidullah M, Kinnunen T. Semi-supervised speech activity detection with an application to automatic speaker verification. COMPUT SPEECH LANG 2018. [DOI: 10.1016/j.csl.2017.07.005] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
32
|
Sehgal A, Kehtarnavaz N. A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2018; 6:9017-9026. [PMID: 30250774 PMCID: PMC6150492 DOI: 10.1109/access.2018.2800728] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
This paper presents a smartphone app that performs real-time voice activity detection based on convolutional neural network. Real-time implementation issues are discussed showing how the slow inference time associated with convolutional neural networks is addressed. The developed smartphone app is meant to act as a switch for noise reduction in the signal processing pipelines of hearing devices, enabling noise estimation or classification to be conducted in noise-only parts of noisy speech signals. The developed smartphone app is compared with a previously developed voice activity detection app as well as with two highly cited voice activity detection algorithms. The experimental results indicate that the developed app using convolutional neural network outperforms the previously developed smartphone app.
Collapse
Affiliation(s)
- Abhishek Sehgal
- Department of Electrical and Computer Engineering, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Nasser Kehtarnavaz
- Department of Electrical and Computer Engineering, University of Texas at Dallas, Richardson, TX 75080, USA
| |
Collapse
|
33
|
Park TJ, Chang JH. Dempster-Shafer theory for enhanced statistical model-based voice activity detection. COMPUT SPEECH LANG 2018. [DOI: 10.1016/j.csl.2017.07.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
34
|
Lee S, Chang JH. Deep learning ensemble with asymptotic techniques for oscillometric blood pressure estimation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 151:1-13. [PMID: 28946991 DOI: 10.1016/j.cmpb.2017.08.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Revised: 06/07/2017] [Accepted: 08/07/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND AND OBJECTIVE This paper proposes a deep learning based ensemble regression estimator with asymptotic techniques, and offers a method that can decrease uncertainty for oscillometric blood pressure (BP) measurements using the bootstrap and Monte-Carlo approach. While the former is used to estimate SBP and DBP, the latter attempts to determine confidence intervals (CIs) for SBP and DBP based on oscillometric BP measurements. METHOD This work originally employs deep belief networks (DBN)-deep neural networks (DNN) to effectively estimate BPs based on oscillometric measurements. However, there are some inherent problems with these methods. First, it is not easy to determine the best DBN-DNN estimator, and worthy information might be omitted when selecting one DBN-DNN estimator and discarding the others. Additionally, our input feature vectors, obtained from only five measurements per subject, represent a very small sample size; this is a critical weakness when using the DBN-DNN technique and can cause overfitting or underfitting, depending on the structure of the algorithm. To address these problems, an ensemble with an asymptotic approach (based on combining the bootstrap with the DBN-DNN technique) is utilized to generate the pseudo features needed to estimate the SBP and DBP. In the first stage, the bootstrap-aggregation technique is used to create ensemble parameters. Afterward, the AdaBoost approach is employed for the second-stage SBP and DBP estimation. We then use the bootstrap and Monte-Carlo techniques in order to determine the CIs based on the target BP estimated using the DBN-DNN ensemble regression estimator with the asymptotic technique in the third stage. RESULTS The proposed method can mitigate the estimation uncertainty such as large the standard deviation of error (SDE) on comparing the proposed DBN-DNN ensemble regression estimator with the DBN-DNN single regression estimator, we identify that the SDEs of the SBP and DBP are reduced by 0.58 and 0.57 mmHg, respectively. These indicate that the proposed method actually enhances the performance by 9.18% and 10.88% compared with the DBN-DNN single estimator. CONCLUSION The proposed methodology improves the accuracy of BP estimation and reduces the uncertainty for BP estimation.
Collapse
Affiliation(s)
- Soojeong Lee
- School of Electronic Engineering, Hanyang University, 222 Wangsimni-ro, Seongdong, Seoul 133-791, Republic of Korea
| | - Joon-Hyuk Chang
- School of Electronic Engineering, Hanyang University, 222 Wangsimni-ro, Seongdong, Seoul 133-791, Republic of Korea.
| |
Collapse
|
35
|
Saki F, Kehtarnavaz N. Automatic switching between noise classification and speech enhancement for hearing aid devices. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2016:736-739. [PMID: 28268433 DOI: 10.1109/embc.2016.7590807] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
This paper presents a voice activity detector (VAD) for automatic switching between a noise classifier and a speech enhancer as part of the signal processing pipeline of hearing aid devices. The developed VAD consists of a computationally efficient feature extractor and a random forest classifier. Previously used signal features as well as two newly introduced signal features are extracted and fed into the classifier to perform automatic switching. This switching approach is compared to two popular VADs. The results obtained indicate the introduced approach outperforms these existing approaches in terms of both detection rate and processing time.
Collapse
|
36
|
|
37
|
Xiao B, Imel ZE, Georgiou PG, Atkins DC, Narayanan SS. "Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing. PLoS One 2015; 10:e0143055. [PMID: 26630392 PMCID: PMC4668058 DOI: 10.1371/journal.pone.0143055] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Accepted: 10/30/2015] [Indexed: 11/30/2022] Open
Abstract
The technology for evaluating patient-provider interactions in psychotherapy–observational coding–has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.
Collapse
Affiliation(s)
- Bo Xiao
- Department of Electrical Engineering, University of Southern California, Los Angeles, United States of America
| | - Zac E. Imel
- Department of Educational Psychology, University of Utah, Salt Lake City, United States of America
- * E-mail:
| | - Panayiotis G. Georgiou
- Department of Electrical Engineering, University of Southern California, Los Angeles, United States of America
| | - David C. Atkins
- Department of Psychiatry and Behavioral Sciences, University of Washington, Seattle, United States of America
| | - Shrikanth S. Narayanan
- Department of Electrical Engineering, University of Southern California, Los Angeles, United States of America
| |
Collapse
|
38
|
Carlin MA, Elhilali M. A Framework for Speech Activity Detection Using Adaptive Auditory Receptive Fields. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2015; 23:2422-2433. [PMID: 29904642 PMCID: PMC5997283 DOI: 10.1109/taslp.2015.2481179] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
One of the hallmarks of sound processing in the brain is the ability of the nervous system to adapt to changing behavioral demands and surrounding soundscapes. It can dynamically shift sensory and cognitive resources to focus on relevant sounds. Neurophysiological studies indicate that this ability is supported by adaptively retuning the shapes of cortical spectro-temporal receptive fields (STRFs) to enhance features of target sounds while suppressing those of task-irrelevant distractors. Because an important component of human communication is the ability of a listener to dynamically track speech in noisy environments, the solution obtained by auditory neurophysiology implies a useful adaptation strategy for speech activity detection (SAD). SAD is an important first step in a number of automated speech processing systems, and performance is often reduced in highly noisy environments. In this paper, we describe how task-driven adaptation is induced in an ensemble of neurophysiological STRFs, and show how speech-adapted STRFs reorient themselves to enhance spectro-temporal modulations of speech while suppressing those associated with a variety of nonspeech sounds. We then show how an adapted ensemble of STRFs can better detect speech in unseen noisy environments compared to an unadapted ensemble and a noise-robust baseline. Finally, we use a stimulus reconstruction task to demonstrate how the adapted STRF ensemble better captures the spectrotemporal modulations of attended speech in clean and noisy conditions. Our results suggest that a biologically plausible adaptation framework can be applied to speech processing systems to dynamically adapt feature representations for improving noise robustness.
Collapse
Affiliation(s)
- Michael A Carlin
- Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Mounya Elhilali
- Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| |
Collapse
|
39
|
Shen F, Chao J, Zhao J. Forecasting exchange rate using deep belief networks and conjugate gradient method. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2015.04.071] [Citation(s) in RCA: 105] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|