1
|
Li M, Erickson IM. It's Not Only What You Say, But Also How You Say It: Machine Learning Approach to Estimate Trust from Conversation. HUMAN FACTORS 2024; 66:1724-1741. [PMID: 37116009 PMCID: PMC11044523 DOI: 10.1177/00187208231166624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
OBJECTIVE The objective of this study was to estimate trust from conversations using both lexical and acoustic data. BACKGROUND As NASA moves to long-duration space exploration operations, the increasing need for cooperation between humans and virtual agents requires real-time trust estimation by virtual agents. Measuring trust through conversation is a novel and unintrusive approach. METHOD A 2 (reliability) × 2 (cycles) × 3 (events) within-subject study with habitat system maintenance was designed to elicit various levels of trust in a conversational agent. Participants had trust-related conversations with the conversational agent at the end of each decision-making task. To estimate trust, subjective trust ratings were predicted using machine learning models trained on three types of conversational features (i.e., lexical, acoustic, and combined). After training, model explanation was performed using variable importance and partial dependence plots. RESULTS Results showed that a random forest algorithm, trained using the combined lexical and acoustic features, predicted trust in the conversational agent most accurately ( R a d j 2 = 0.71 ) . The most important predictors were a combination of lexical and acoustic cues: average sentiment considering valence shifters, the mean of formants, and Mel-frequency cepstral coefficients (MFCC). These conversational features were identified as partial mediators predicting people's trust. CONCLUSION Precise trust estimation from conversation requires lexical cues and acoustic cues. APPLICATION These results showed the possibility of using conversational data to measure trust, and potentially other dynamic mental states, unobtrusively and dynamically.
Collapse
Affiliation(s)
- Mengyao Li
- Department of Industrial and Systems Engineering, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Isabel M Erickson
- Department of Industrial and Systems Engineering, University of Wisconsin-Madison, Madison, Wisconsin, USA
| |
Collapse
|
2
|
Humayun MA, Shuja J, Abas PE. A review of social background profiling of speakers from speech accents. PeerJ Comput Sci 2024; 10:e1984. [PMID: 38660189 PMCID: PMC11042007 DOI: 10.7717/peerj-cs.1984] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Accepted: 03/18/2024] [Indexed: 04/26/2024]
Abstract
Social background profiling of speakers is heavily used in areas, such as, speech forensics, and tuning speech recognition for accuracy improvement. This article provides a survey of recent research in speaker background profiling in terms of accent classification and analyses the datasets, speech features, and classification models used for the classification tasks. The aim is to provide a comprehensive overview of recent research related to speaker background profiling and to present a comparative analysis of the achieved performance measures. Comprehensive descriptions of the datasets, speech features, and classification models used in recent research for accent classification have been presented, with a comparative analysis made on the performance measures of the different methods. This analysis provides insights into the strengths and weaknesses of the different methods for accent classification. Subsequently, research gaps have been identified, which serve as a useful resource for researchers looking to advance the field.
Collapse
Affiliation(s)
- Mohammad Ali Humayun
- Department of Computer Science, Information Technology University, Lahore, Pakistan
| | - Junaid Shuja
- Department of Computer and Information Sciences, Universiti Teknologi PETRONAS, Seri Iskandar, Malaysia
| | | |
Collapse
|
3
|
Deng M, Chen J, Wu Y, Ma S, Li H, Yang Z, Shen Y. Using voice recognition to measure trust during interactions with automated vehicles. APPLIED ERGONOMICS 2024; 116:104184. [PMID: 38048717 DOI: 10.1016/j.apergo.2023.104184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 11/10/2023] [Accepted: 11/20/2023] [Indexed: 12/06/2023]
Abstract
Trust in an automated vehicle system (AVs) can impact the experience and safety of drivers and passengers. This work investigates the effects of speech to measure drivers' trust in the AVs. Seventy-five participants were randomly assigned to high-trust (the AVs with 100% correctness, 0 crash, and 4 system messages with visual-auditory TORs) and low-trust group (the AVs with a correctness of 60%, a crash rate of 40%, 2 system messages with visual-only TORs). Voice interaction tasks were used to collect speech information during the driving process. The results revealed that our settings successfully induced trust and distrust states. The corresponding extracted speech feature data of the two trust groups were used for back-propagation neural network training and evaluated for its ability to accurately predict the trust classification. The highest classification accuracy of trust was 90.80%. This study proposes a method for accurately measuring trust in automated vehicles using voice recognition.
Collapse
Affiliation(s)
- Miaomiao Deng
- Department of Psychology, Zhejiang Sci-Tech University, Hangzhou, China
| | - Jiaqi Chen
- Department of Psychology, Zhejiang Sci-Tech University, Hangzhou, China
| | - Yue Wu
- Department of Psychology, Zhejiang Sci-Tech University, Hangzhou, China
| | - Shu Ma
- Department of Psychology, Zhejiang Sci-Tech University, Hangzhou, China
| | - Hongting Li
- Institute of Applied Psychology, College of Education, Zhejiang University of Technology, Hangzhou, China
| | - Zhen Yang
- Department of Psychology, Zhejiang Sci-Tech University, Hangzhou, China.
| | - Yi Shen
- Department of Mathematics, Zhejiang Sci-Tech University, Hangzhou, China.
| |
Collapse
|
4
|
Pulatov I, Oteniyazov R, Makhmudov F, Cho YI. Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders. SENSORS (BASEL, SWITZERLAND) 2023; 23:6640. [PMID: 37514933 PMCID: PMC10383041 DOI: 10.3390/s23146640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 07/21/2023] [Accepted: 07/21/2023] [Indexed: 07/30/2023]
Abstract
Understanding and identifying emotional cues in human speech is a crucial aspect of human-computer communication. The application of computer technology in dissecting and deciphering emotions, along with the extraction of relevant emotional characteristics from speech, forms a significant part of this process. The objective of this study was to architect an innovative framework for speech emotion recognition predicated on spectrograms and semantic feature transcribers, aiming to bolster performance precision by acknowledging the conspicuous inadequacies in extant methodologies and rectifying them. To procure invaluable attributes for speech detection, this investigation leveraged two divergent strategies. Primarily, a wholly convolutional neural network model was engaged to transcribe speech spectrograms. Subsequently, a cutting-edge Mel-frequency cepstral coefficient feature abstraction approach was adopted and integrated with Speech2Vec for semantic feature encoding. These dual forms of attributes underwent individual processing before they were channeled into a long short-term memory network and a comprehensive connected layer for supplementary representation. By doing so, we aimed to bolster the sophistication and efficacy of our speech emotion detection model, thereby enhancing its potential to accurately recognize and interpret emotion from human speech. The proposed mechanism underwent a rigorous evaluation process employing two distinct databases: RAVDESS and EMO-DB. The outcome displayed a predominant performance when juxtaposed with established models, registering an impressive accuracy of 94.8% on the RAVDESS dataset and a commendable 94.0% on the EMO-DB dataset. This superior performance underscores the efficacy of our innovative system in the realm of speech emotion recognition, as it outperforms current frameworks in accuracy metrics.
Collapse
Affiliation(s)
- Ilkhomjon Pulatov
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| | - Rashid Oteniyazov
- Department of Telecommunication Engineering, Nukus Branch of Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Nukus 230100, Uzbekistan
| | - Fazliddin Makhmudov
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| | - Young-Im Cho
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| |
Collapse
|
5
|
Kumar MR, Vekkot S, Lalitha S, Gupta D, Govindraj VJ, Shaukat K, Alotaibi YA, Zakariah M. Dementia Detection from Speech Using Machine Learning and Deep Learning Architectures. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22239311. [PMID: 36502013 PMCID: PMC9740675 DOI: 10.3390/s22239311] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 11/14/2022] [Accepted: 11/15/2022] [Indexed: 06/01/2023]
Abstract
Dementia affects the patient's memory and leads to language impairment. Research has demonstrated that speech and language deterioration is often a clear indication of dementia and plays a crucial role in the recognition process. Even though earlier studies have used speech features to recognize subjects suffering from dementia, they are often used along with other linguistic features obtained from transcriptions. This study explores significant standalone speech features to recognize dementia. The primary contribution of this work is to identify a compact set of speech features that aid in the dementia recognition process. The secondary contribution is to leverage machine learning (ML) and deep learning (DL) models for the recognition task. Speech samples from the Pitt corpus in Dementia Bank are utilized for the present study. The critical speech feature set of prosodic, voice quality and cepstral features has been proposed for the task. The experimental results demonstrate the superiority of machine learning (87.6 percent) over deep learning (85 percent) models for recognizing Dementia using the compact speech feature combination, along with lower time and memory consumption. The results obtained using the proposed approach are promising compared with the existing works on dementia recognition using speech.
Collapse
Affiliation(s)
- M. Rupesh Kumar
- Department of Electronics & Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bengaluru 560035, India
| | - Susmitha Vekkot
- Department of Electronics & Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bengaluru 560035, India
| | - S. Lalitha
- Department of Electronics & Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bengaluru 560035, India
| | - Deepa Gupta
- Department of Computer Science & Engineering, Amrita School of Computing, Amrita Vishwa Vidyapeetham, Bengaluru 560035, India
| | - Varasiddhi Jayasuryaa Govindraj
- Department of Electronics & Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bengaluru 560035, India
| | - Kamran Shaukat
- School of Information and Physical Sciences, The University of Newcastle, Newcastle 2300, Australia
| | - Yousef Ajami Alotaibi
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
| | - Mohammed Zakariah
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
| |
Collapse
|
6
|
Early recognition of a caller's emotion in out-of-hospital cardiac arrest dispatching: An artificial intelligence approach. Resuscitation 2021; 167:144-150. [PMID: 34461203 DOI: 10.1016/j.resuscitation.2021.08.032] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2021] [Revised: 08/11/2021] [Accepted: 08/18/2021] [Indexed: 11/23/2022]
Abstract
AIM This study aimed to develop an AI model for detecting a caller's emotional state during out-of-hospital cardiac arrest calls by processing audio recordings of dispatch communications. METHODS Audio recordings of 337 out-of-hospital cardiac arrest calls from March-April 2011 were retrieved. The callers' emotional state was classified based on the emotional content and cooperative scores. Mel-frequency cepstral coefficients extracted essential information from the voice signals. A support vector machine was utilised for the automatic judgement, and repeated random sub-sampling cross validation (RRS-CV) was applied to evaluate robustness. The results from the artificial intelligence classifier were compared with the consensus of expert reviewers. RESULTS The audio recordings were classified into five emotional content and cooperative score levels. The proposed model had an average positive predictive value of 72.97%, a negative predictive value of 93.47%, sensitivity of 38.76%, and specificity of 98.29%. If only the first 10 seconds of the recordings were considered, it had an average positive predictive value of 84.62%, a negative predictive value of 93.57%, sensitivity of 52.38%, and specificity of 98.64%. The artificial intelligence model's performance maintained preferable results for emotionally stable cases. CONCLUSION Artificial intelligence models can possibly facilitate the judgement of callers' emotional states during dispatch conversations. This model has the potential to be utilised in practice, by pre-screening emotionally stable callers, thus allowing dispatchers to focus on cases that are judged to be emotionally unstable. Further research and validation are required to improve the model's performance and make it suitable for the general population.
Collapse
|
7
|
Abstract
Recognizing emotions and human speech has always been an exciting challenge for scientists. In our work the parameterization of the vector is obtained and realized from the sentence divided into the containing emotional-informational part and the informational part is effectively applied. The expressiveness of human speech is improved by the emotion it conveys. There are several characteristics and features of speech that differentiate it among utterances, i.e. various prosodic features like pitch, timbre, loudness and vocal tone which categorize speech into several emotions. They were supplemented by us with a new classification feature of speech, which consists in dividing a sentence into an emotionally loaded part of the sentence and a part that carries only informational load. Therefore, the sample speech is changed when it is subjected to various emotional environments. As the identification of the speaker’s emotional states can be done based on the Mel scale, MFCC is one such variant to study the emotional aspects of a speaker’s utterances. In this work, we implement a model to identify several emotional states from MFCC for two datasets, classify emotions for them on the basis of MFCC features and give the correspondent comparison of them. Overall, this work implements the classification model based on dataset minimization that is done by taking the mean of features for the improvement of the classification accuracy rate in different machine learning algorithms. In addition to the static analysis of the author's tonal portrait, which is used in particular in MFFC, we propose a new method for the dynamic analysis of the phrase in processing and studying as a new linguistic-emotional entity pronounced by the same author. Due to the ranking by the importance of the MEL scale features, we are able to parameterize the vectors coordinates be processed by the parametrized KNN method. Language recognition is a multi-level task of pattern recognition. Here acoustic signals are analyzed and structured in a hierarchy of structural elements, words, phrases and sentences. Each level of such a hierarchy may provide some temporal constants: possible word sequences or known types of pronunciation that reduce the number of recognition errors at a lower level. An analysis of voice and speech dynamics is appropriate for improving the quality of human perception and the formation of human speech by a machine and is within the capabilities of artificial intelligence. Emotion results can be widely applied in e-learning platforms, vehicle on-board systems, medicine, etc
Collapse
|
8
|
|
9
|
Pramod Reddy A, V V. Recognition of human emotion with spectral features using multi layer-perceptron. INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS 2020. [DOI: 10.3233/kes-200044] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
For emotion recognition, here the features extracted from prevalent speech samples of Berlin emotional database are pitch, intensity, log energy, formant, mel-frequency ceptral coefficients (MFCC) as base features and power spectral density as an added function of frequency. In these work seven emotions namely anger, neutral, happy, Boredom, disgust, fear and sadness are considered in our study. Temporal and Spectral features are considered for building AER(Automatic Emotion Recognition) model. The extracted features are analyzed using Support Vector Machine (SVM) and with multilayer perceptron (MLP) a class of feed-forward ANN classifiers is/are used to classify different emotional states. We observed 91% accuracy for Angry and Boredom emotional classes by using SVM and more than 96% accuracy using ANN and with an overall accuracy of 87.17% using SVM, 94% for ANN.
Collapse
|
10
|
Yang N, Dey N, Sherratt RS, Shi F. Recognize basic emotional statesin speech by machine learning techniques using mel-frequency cepstral coefficient features. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-179963] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Speech Emotion Recognition (SER) has been widely used in many fields, such as smart home assistants commonly found in the market. Smart home assistants that could detect the user’s emotion would improve the communication between a user and the assistant enabling the assistant to offer more productive feedback. Thus, the aim of this work is to analyze emotional states in speech and propose a suitable algorithm considering performance verses complexity for deployment in smart home devices. The four emotional speech sets were selected from the Berlin Emotional Database (EMO-DB) as experimental data, 26 MFCC features were extracted from each type of emotional speech to identify the emotions of happiness, anger, sadness and neutrality. Then, speaker-independent experiments for our Speech emotion Recognition (SER) were conducted by using the Back Propagation Neural Network (BPNN), Extreme Learning Machine (ELM), Probabilistic Neural Network (PNN) and Support Vector Machine (SVM). Synthesizing the recognition accuracy and processing time, this work shows that the performance of SVM was the best among the four methods as a good candidate to be deployed for SER in smart home devices. SVM achieved an overall accuracy of 92.4% while offering low computational requirements when training and testing. We conclude that the MFCC features and the SVM classification models used in speaker-independent experiments are highly effective in the automatic prediction of emotion.
Collapse
Affiliation(s)
- Ningning Yang
- First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China
| | - Nilanjan Dey
- Department of Information Technology, Techno India College of Technology, West Bengal, India
| | | | - Fuqian Shi
- First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China
| |
Collapse
|
11
|
Reduction of the Multipath Propagation Effect in a Hydroacoustic Channel Using Filtration in Cepstrum. SENSORS 2020; 20:s20030751. [PMID: 32013243 PMCID: PMC7038370 DOI: 10.3390/s20030751] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2019] [Revised: 01/22/2020] [Accepted: 01/26/2020] [Indexed: 11/24/2022]
Abstract
During data transmission in a hydroacoustic channel, one of the problems is the multipath propagation effect, which leads to a decrease in the transmission parameters and sometimes completely prevents it. Therefore, we have attempted to develop a method, which is based on a recorded hydroacoustic signal, that allows us to recreate the original (generated) signal by eliminating the multipath effect. In our method, we use cepstral analysis to eliminate replicas of the generated signal. The method has been tested in simulation and during measurements in a real environment. Additionally, the influence of the method on data transmission in the hydroacoustic channel was tested. The obtained results confirmed the usefulness of the application of the developed method and improved the quality of data transmission by reducing the multipath propagation effect.
Collapse
|
12
|
Classifying Heart Sounds Using Images of Motifs, MFCC and Temporal Features. J Med Syst 2019; 43:168. [DOI: 10.1007/s10916-019-1286-5] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Accepted: 04/09/2019] [Indexed: 10/26/2022]
|
13
|
Zerari N, Abdelhamid S, Bouzgou H, Raymond C. Bidirectional deep architecture for Arabic speech recognition. OPEN COMPUTER SCIENCE 2019. [DOI: 10.1515/comp-2019-0004] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
AbstractNowadays, the real life constraints necessitates controlling modern machines using human intervention by means of sensorial organs. The voice is one of the human senses that can control/monitor modern interfaces. In this context, Automatic Speech Recognition is principally used to convert natural voice into computer text as well as to perform an action based on the instructions given by the human. In this paper, we propose a general framework for Arabic speech recognition that uses Long Short-Term Memory (LSTM) and Neural Network (Multi-Layer Perceptron: MLP) classifier to cope with the nonuniform sequence length of the speech utterances issued fromboth feature extraction techniques, (1)Mel Frequency Cepstral Coefficients MFCC (static and dynamic features), (2) the Filter Banks (FB) coefficients. The neural architecture can recognize the isolated Arabic speech via classification technique. The proposed system involves, first, extracting pertinent features from the natural speech signal using MFCC (static and dynamic features) and FB. Next, the extracted features are padded in order to deal with the non-uniformity of the sequences length. Then, a deep architecture represented by a recurrent LSTM or GRU (Gated Recurrent Unit) architectures are used to encode the sequences of MFCC/FB features as a fixed size vector that will be introduced to a Multi-Layer Perceptron network (MLP) to perform the classification (recognition). The proposed system is assessed using two different databases, the first one concerns the spoken digit recognition where a comparison with other related works in the literature is performed, whereas the second one contains the spoken TV commands. The obtained results show the superiority of the proposed approach.
Collapse
Affiliation(s)
- Naima Zerari
- Laboratory of Automation and Manufacturing, Department of Industrial Engineering, University of Batna 2 Mostefa Ben Boulaid, Batna, 05000, Algeria
| | - Samir Abdelhamid
- Laboratory of Automation and Manufacturing, Department of Industrial Engineering, University of Batna 2 Mostefa Ben Boulaid, Batna, 05000, Algeria
| | - Hassen Bouzgou
- Department of Industrial Engineering, University of Batna 2 Mostefa Ben Boulaid, Batna, 05000, Algeria
| | | |
Collapse
|
14
|
Nogueira DM, Ferreira CA, Jorge AM. Classifying Heart Sounds Using Images of MFCC and Temporal Features. PROGRESS IN ARTIFICIAL INTELLIGENCE 2017. [DOI: 10.1007/978-3-319-65340-2_16] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/03/2023]
|