1
|
Dey S, Sahidullah M, Saha G. Cross-corpora spoken language identification with domain diversification and generalization. COMPUT SPEECH LANG 2023. [DOI: 10.1016/j.csl.2023.101489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
|
2
|
Zong Y, Lian H, Zhang J, Feng E, Lu C, Chang H, Tang C. Progressive distribution adapted neural networks for cross-corpus speech emotion recognition. Front Neurorobot 2022; 16:987146. [PMID: 36187564 PMCID: PMC9520908 DOI: 10.3389/fnbot.2022.987146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Accepted: 07/25/2022] [Indexed: 12/02/2022] Open
Abstract
In this paper, we investigate a challenging but interesting task in the research of speech emotion recognition (SER), i.e., cross-corpus SER. Unlike the conventional SER, the training (source) and testing (target) samples in cross-corpus SER come from different speech corpora, which results in a feature distribution mismatch between them. Hence, the performance of most existing SER methods may sharply decrease. To cope with this problem, we propose a simple yet effective deep transfer learning method called progressive distribution adapted neural networks (PDAN). PDAN employs convolutional neural networks (CNN) as the backbone and the speech spectrum as the inputs to achieve an end-to-end learning framework. More importantly, its basic idea for solving cross-corpus SER is very straightforward, i.e., enhancing the backbone's corpus invariant feature learning ability by incorporating a progressive distribution adapted regularization term into the original loss function to guide the network training. To evaluate the proposed PDAN, extensive cross-corpus SER experiments on speech emotion corpora including EmoDB, eNTERFACE, and CASIA are conducted. Experimental results showed that the proposed PDAN outperforms most well-performing deep and subspace transfer learning methods in dealing with the cross-corpus SER tasks.
Collapse
Affiliation(s)
- Yuan Zong
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing, China
- School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
- *Correspondence: Yuan Zong
| | - Hailun Lian
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing, China
| | - Jiacheng Zhang
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing, China
- School of Cyber Science and Engineering, Southeast University, Nanjing, China
| | - Ercui Feng
- Affiliated Jiangning Hospital, Nanjing Medical University, Nanjing, China
| | - Cheng Lu
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing, China
| | - Hongli Chang
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing, China
| | - Chuangao Tang
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing, China
- School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| |
Collapse
|
3
|
Zong Y, Lian H, Chang H, Lu C, Tang C. Adapting Multiple Distributions for Bridging Emotions from Different Speech Corpora. Entropy (Basel) 2022; 24:1250. [PMID: 36141136 PMCID: PMC9497589 DOI: 10.3390/e24091250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 09/01/2022] [Accepted: 09/02/2022] [Indexed: 06/16/2023]
Abstract
In this paper, we focus on a challenging, but interesting, task in speech emotion recognition (SER), i.e., cross-corpus SER. Unlike conventional SER, a feature distribution mismatch may exist between the labeled source (training) and target (testing) speech samples in cross-corpus SER because they come from different speech emotion corpora, which degrades the performance of most well-performing SER methods. To address this issue, we propose a novel transfer subspace learning method called multiple distribution-adapted regression (MDAR) to bridge the gap between speech samples from different corpora. Specifically, MDAR aims to learn a projection matrix to build the relationship between the source speech features and emotion labels. A novel regularization term called multiple distribution adaption (MDA), consisting of a marginal and two conditional distribution-adapted operations, is designed to collaboratively enable such a discriminative projection matrix to be applicable to the target speech samples, regardless of speech corpus variance. Consequently, by resorting to the learned projection matrix, we are able to predict the emotion labels of target speech samples when only the source label information is given. To evaluate the proposed MDAR method, extensive cross-corpus SER tasks based on three different speech emotion corpora, i.e., EmoDB, eNTERFACE, and CASIA, were designed. Experimental results showed that the proposed MDAR outperformed most recent state-of-the-art transfer subspace learning methods and even performed better than several well-performing deep transfer learning methods in dealing with cross-corpus SER tasks.
Collapse
Affiliation(s)
- Yuan Zong
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
- School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Hailun Lian
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
- School of Information Science and Engineering, Southeast University, Nanjing 210096, China
| | - Hongli Chang
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
- School of Information Science and Engineering, Southeast University, Nanjing 210096, China
| | - Cheng Lu
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
- School of Information Science and Engineering, Southeast University, Nanjing 210096, China
| | - Chuangao Tang
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
- School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| |
Collapse
|
4
|
Wen G, Liao H, Li H, Wen P, Zhang T, Gao S, Wang B. Self-labeling with feature transfer for speech emotion recognition. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
5
|
Zhang S, Liu R, Tao X, Zhao X. Deep Cross-Corpus Speech Emotion Recognition: Recent Advances and Perspectives. Front Neurorobot 2021; 15:784514. [PMID: 34912204 PMCID: PMC8666588 DOI: 10.3389/fnbot.2021.784514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Accepted: 11/08/2021] [Indexed: 11/13/2022] Open
Abstract
Automatic speech emotion recognition (SER) is a challenging component of human-computer interaction (HCI). Existing literatures mainly focus on evaluating the SER performance by means of training and testing on a single corpus with a single language setting. However, in many practical applications, there are great differences between the training corpus and testing corpus. Due to the diversity of different speech emotional corpus or languages, most previous SER methods do not perform well when applied in real-world cross-corpus or cross-language scenarios. Inspired by the powerful feature learning ability of recently-emerged deep learning techniques, various advanced deep learning models have increasingly been adopted for cross-corpus SER. This paper aims to provide an up-to-date and comprehensive survey of cross-corpus SER, especially for various deep learning techniques associated with supervised, unsupervised and semi-supervised learning in this area. In addition, this paper also highlights different challenges and opportunities on cross-corpus SER tasks, and points out its future trends.
Collapse
Affiliation(s)
- Shiqing Zhang
- Institute of Intelligence Information Processing, Taizhou University, Zhejiang, China
| | - Ruixin Liu
- Institute of Intelligence Information Processing, Taizhou University, Zhejiang, China
- School of Sugon Big Data Science, Zhejiang University of Science and Technology, Zhejiang, China
| | - Xin Tao
- Institute of Intelligence Information Processing, Taizhou University, Zhejiang, China
| | - Xiaoming Zhao
- Institute of Intelligence Information Processing, Taizhou University, Zhejiang, China
| |
Collapse
|
6
|
Gideon J, McInnis MG, Provost EM. Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG). IEEE Trans Affect Comput 2021; 12:1055-1068. [PMID: 35695825 PMCID: PMC9173710 DOI: 10.1109/taffc.2019.2916092] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train "meet in the middle" approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.
Collapse
|
7
|
Liu Z, Rehman A, Wu M, Cao W, Hao M. Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf Sci (N Y) 2021; 563:309-25. [DOI: 10.1016/j.ins.2021.02.016] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
8
|
Xiao Y, Zhao H, Li T. Learning Class-Aligned and Generalized Domain-Invariant Representations for Speech Emotion Recognition. IEEE Trans Emerg Top Comput Intell 2020. [DOI: 10.1109/tetci.2020.2972926] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|