1
|
Bian Y, Küster D, Liu H, Krumhuber EG. Understanding Naturalistic Facial Expressions with Deep Learning and Multimodal Large Language Models. SENSORS (BASEL, SWITZERLAND) 2023; 24:126. [PMID: 38202988 PMCID: PMC10781259 DOI: 10.3390/s24010126] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 11/30/2023] [Accepted: 12/21/2023] [Indexed: 01/12/2024]
Abstract
This paper provides a comprehensive overview of affective computing systems for facial expression recognition (FER) research in naturalistic contexts. The first section presents an updated account of user-friendly FER toolboxes incorporating state-of-the-art deep learning models and elaborates on their neural architectures, datasets, and performances across domains. These sophisticated FER toolboxes can robustly address a variety of challenges encountered in the wild such as variations in illumination and head pose, which may otherwise impact recognition accuracy. The second section of this paper discusses multimodal large language models (MLLMs) and their potential applications in affective science. MLLMs exhibit human-level capabilities for FER and enable the quantification of various contextual variables to provide context-aware emotion inferences. These advancements have the potential to revolutionize current methodological approaches for studying the contextual influences on emotions, leading to the development of contextualized emotion models.
Collapse
Affiliation(s)
- Yifan Bian
- Department of Experimental Psychology, University College London, London WC1H 0AP, UK;
| | - Dennis Küster
- Department of Mathematics and Computer Science, University of Bremen, 28359 Bremen, Germany; (D.K.); (H.L.)
| | - Hui Liu
- Department of Mathematics and Computer Science, University of Bremen, 28359 Bremen, Germany; (D.K.); (H.L.)
| | - Eva G. Krumhuber
- Department of Experimental Psychology, University College London, London WC1H 0AP, UK;
| |
Collapse
|
2
|
Namba S, Sato W, Namba S, Nomiya H, Shimokawa K, Osumi M. Development of the RIKEN database for dynamic facial expressions with multiple angles. Sci Rep 2023; 13:21785. [PMID: 38066065 PMCID: PMC10709572 DOI: 10.1038/s41598-023-49209-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 12/05/2023] [Indexed: 12/18/2023] Open
Abstract
The development of facial expressions with sensing information is progressing in multidisciplinary fields, such as psychology, affective computing, and cognitive science. Previous facial datasets have not simultaneously dealt with multiple theoretical views of emotion, individualized context, or multi-angle/depth information. We developed a new facial database (RIKEN facial expression database) that includes multiple theoretical views of emotions and expressers' individualized events with multi-angle and depth information. The RIKEN facial expression database contains recordings of 48 Japanese participants captured using ten Kinect cameras at 25 events. This study identified several valence-related facial patterns and found them consistent with previous research investigating the coherence between facial movements and internal states. This database represents an advancement in developing a new sensing system, conducting psychological experiments, and understanding the complexity of emotional events.
Collapse
Affiliation(s)
- Shushi Namba
- RIKEN, Psychological Process Research Team, Guardian Robot Project, Kyoto, 6190288, Japan.
- Department of Psychology, Hiroshima University, Hiroshima, 7398524, Japan.
| | - Wataru Sato
- RIKEN, Psychological Process Research Team, Guardian Robot Project, Kyoto, 6190288, Japan.
| | - Saori Namba
- Department of Psychology, Hiroshima University, Hiroshima, 7398524, Japan
| | - Hiroki Nomiya
- Faculty of Information and Human Sciences, Kyoto Institute of Technology, Kyoto, 6068585, Japan
| | - Koh Shimokawa
- KOHINATA Limited Liability Company, Osaka, 5560020, Japan
| | - Masaki Osumi
- KOHINATA Limited Liability Company, Osaka, 5560020, Japan
| |
Collapse
|
3
|
Namba S, Sato W, Matsui H. Spatio-Temporal Properties of Amused, Embarrassed, and Pained Smiles. JOURNAL OF NONVERBAL BEHAVIOR 2022. [DOI: 10.1007/s10919-022-00404-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
AbstractSmiles are universal but nuanced facial expressions that are most frequently used in face-to-face communications, typically indicating amusement but sometimes conveying negative emotions such as embarrassment and pain. Although previous studies have suggested that spatial and temporal properties could differ among these various types of smiles, no study has thoroughly analyzed these properties. This study aimed to clarify the spatiotemporal properties of smiles conveying amusement, embarrassment, and pain using a spontaneous facial behavior database. The results regarding spatial patterns revealed that pained smiles showed less eye constriction and more overall facial tension than amused smiles; no spatial differences were identified between embarrassed and amused smiles. Regarding temporal properties, embarrassed and pained smiles remained in a state of higher facial tension than amused smiles. Moreover, embarrassed smiles showed a more gradual change from tension states to the smile state than amused smiles, and pained smiles had lower probabilities of staying in or transitioning to the smile state compared to amused smiles. By comparing the spatiotemporal properties of these three smile types, this study revealed that the probability of transitioning between discrete states could help distinguish amused, embarrassed, and pained smiles.
Collapse
|
4
|
Guerdelli H, Ferrari C, Barhoumi W, Ghazouani H, Berretti S. Macro- and Micro-Expressions Facial Datasets: A Survey. SENSORS 2022; 22:s22041524. [PMID: 35214430 PMCID: PMC8879817 DOI: 10.3390/s22041524] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/10/2022] [Accepted: 02/11/2022] [Indexed: 11/16/2022]
Abstract
Automatic facial expression recognition is essential for many potential applications. Thus, having a clear overview on existing datasets that have been investigated within the framework of face expression recognition is of paramount importance in designing and evaluating effective solutions, notably for neural networks-based training. In this survey, we provide a review of more than eighty facial expression datasets, while taking into account both macro- and micro-expressions. The proposed study is mostly focused on spontaneous and in-the-wild datasets, given the common trend in the research is that of considering contexts where expressions are shown in a spontaneous way and in a real context. We have also provided instances of potential applications of the investigated datasets, while putting into evidence their pros and cons. The proposed survey can help researchers to have a better understanding of the characteristics of the existing datasets, thus facilitating the choice of the data that best suits the particular context of their application.
Collapse
Affiliation(s)
- Hajer Guerdelli
- Research Team on Intelligent Systems in Imaging and Artificial Vision (SIIVA), LR16ES06 Laboratoire de Recherche en Informatique, Modélisation et Traitement de’Information et dea Connaissance (LIMTIC), Institut Supérieur d’Informatique d’El Manar, Université de Tunis El Manar, Tunis 1068, Tunisia; (H.G.); (W.B.); (H.G.)
- Media Integration and Communication Center, University of Florence, 50121 Firenze, Italy
| | - Claudio Ferrari
- Department of Engineering and Architecture, University of Parma, 43121 Parma, Italy;
| | - Walid Barhoumi
- Research Team on Intelligent Systems in Imaging and Artificial Vision (SIIVA), LR16ES06 Laboratoire de Recherche en Informatique, Modélisation et Traitement de’Information et dea Connaissance (LIMTIC), Institut Supérieur d’Informatique d’El Manar, Université de Tunis El Manar, Tunis 1068, Tunisia; (H.G.); (W.B.); (H.G.)
- Ecole Nationale d’Ingénieurs de Carthage, Université de Carthage, Carthage 1054, Tunisia
| | - Haythem Ghazouani
- Research Team on Intelligent Systems in Imaging and Artificial Vision (SIIVA), LR16ES06 Laboratoire de Recherche en Informatique, Modélisation et Traitement de’Information et dea Connaissance (LIMTIC), Institut Supérieur d’Informatique d’El Manar, Université de Tunis El Manar, Tunis 1068, Tunisia; (H.G.); (W.B.); (H.G.)
- Ecole Nationale d’Ingénieurs de Carthage, Université de Carthage, Carthage 1054, Tunisia
| | - Stefano Berretti
- Media Integration and Communication Center, University of Florence, 50121 Firenze, Italy
- Correspondence: ; Tel.: +39-216-96202969
| |
Collapse
|
5
|
Li Y, Zeng J, Shan S. Learning Representations for Facial Actions From Unlabeled Videos. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:302-317. [PMID: 32750828 DOI: 10.1109/tpami.2020.3011063] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Facial actions are usually encoded as anatomy-based action units (AUs), the labelling of which demands expertise and thus is time-consuming and expensive. To alleviate the labelling demand, we propose to leverage the large number of unlabelled videos by proposing a twin-cycle autoencoder (TAE) to learn discriminative representations for facial actions. TAE is inspired by the fact that facial actions are embedded in the pixel-wise displacements between two sequential face images (hereinafter, source and target) in the video. Therefore, learning the representations of facial actions can be achieved by learning the representations of the displacements. However, the displacements induced by facial actions are entangled with those induced by head motions. TAE is thus trained to disentangle the two kinds of movements by evaluating the quality of the synthesized images when either the facial actions or head pose is changed, aiming to reconstruct the target image. Experiments on AU detection show that TAE can achieve accuracy comparable to other existing AU detection methods including some supervised methods, thus validating the discriminant capacity of the representations learned by TAE. TAE's ability in decoupling the action-induced and pose-induced movements is also validated by visualizing the generated images and analyzing the facial image retrieval results qualitatively and quantitatively.
Collapse
|
6
|
Namba S, Sato W, Osumi M, Shimokawa K. Assessing Automated Facial Action Unit Detection Systems for Analyzing Cross-Domain Facial Expression Databases. SENSORS (BASEL, SWITZERLAND) 2021; 21:4222. [PMID: 34203007 PMCID: PMC8235167 DOI: 10.3390/s21124222] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 06/15/2021] [Accepted: 06/17/2021] [Indexed: 11/16/2022]
Abstract
In the field of affective computing, achieving accurate automatic detection of facial movements is an important issue, and great progress has already been made. However, a systematic evaluation of systems that now have access to the dynamic facial database remains an unmet need. This study compared the performance of three systems (FaceReader, OpenFace, AFARtoolbox) that detect each facial movement corresponding to an action unit (AU) derived from the Facial Action Coding System. All machines could detect the presence of AUs from the dynamic facial database at a level above chance. Moreover, OpenFace and AFAR provided higher area under the receiver operating characteristic curve values compared to FaceReader. In addition, several confusion biases of facial components (e.g., AU12 and AU14) were observed to be related to each automated AU detection system and the static mode was superior to dynamic mode for analyzing the posed facial database. These findings demonstrate the features of prediction patterns for each system and provide guidance for research on facial expressions.
Collapse
Affiliation(s)
- Shushi Namba
- Psychological Process Team, BZP, Robotics Project, RIKEN, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 6190288, Japan
| | - Wataru Sato
- Psychological Process Team, BZP, Robotics Project, RIKEN, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 6190288, Japan
| | - Masaki Osumi
- KOHINATA Limited Liability Company, 2-7-3, Tateba, Naniwa-ku, Osaka 5560020, Japan; (M.O.); (K.S.)
| | - Koh Shimokawa
- KOHINATA Limited Liability Company, 2-7-3, Tateba, Naniwa-ku, Osaka 5560020, Japan; (M.O.); (K.S.)
| |
Collapse
|
7
|
Lin V, Girard JM, Sayette MA, Morency LP. Toward Multimodal Modeling of Emotional Expressiveness. PROCEEDINGS OF THE ... ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ICMI (CONFERENCE) 2020; 2020:548-557. [PMID: 33969360 PMCID: PMC8106384 DOI: 10.1145/3382507.3418887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Emotional expressiveness captures the extent to which a person tends to outwardly display their emotions through behavior. Due to the close relationship between emotional expressiveness and behavioral health, as well as the crucial role that it plays in social interaction, the ability to automatically predict emotional expressiveness stands to spur advances in science, medicine, and industry. In this paper, we explore three related research questions. First, how well can emotional expressiveness be predicted from visual, linguistic, and multimodal behavioral signals? Second, how important is each behavioral modality to the prediction of emotional expressiveness? Third, which behavioral signals are reliably related to emotional expressiveness? To answer these questions, we add highly reliable transcripts and human ratings of perceived emotional expressiveness to an existing video database and use this data to train, validate, and test predictive models. Our best model shows promising predictive performance on this dataset (RMSE = 0.65, R 2 = 0.45, r = 0.74). Multimodal models tend to perform best overall, and models trained on the linguistic modality tend to outperform models trained on the visual modality. Finally, examination of our interpretable models' coefficients reveals a number of visual and linguistic behavioral signals-such as facial action unit intensity, overall word count, and use of words related to social processes-that reliably predict emotional expressiveness.
Collapse
|
8
|
|
9
|
Ertugrul IO, Cohn JF, Jeni LA, Zhang Z, Yin L, Ji Q. Crossing Domains for AU Coding: Perspectives, Approaches, and Measures. IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE 2020; 2:158-171. [PMID: 32377637 PMCID: PMC7202467 DOI: 10.1109/tbiom.2020.2977225] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Facial action unit (AU) detectors have performed well when trained and tested within the same domain. How well do AU detectors transfer to domains in which they have not been trained? We review literature on cross-domain transfer and conduct experiments to address limitations of prior research. We evaluate generalizability in four publicly available databases. EB+ (an expanded version of BP4D+), Sayette GFT, DISFA and UNBC Shoulder Pain (SP). The databases differ in observational scenarios, context, participant diversity, range of head pose, video resolution, and AU base rates. In most cases performance decreased with change in domain, often to below the threshold needed for behavioral research. However, exceptions were noted. Deep and shallow approaches generally performed similarly and average results were slightly better for deep model compared to shallow one. Occlusion sensitivity maps revealed that local specificity was greater for AU detection within than cross domains. The findings suggest that more varied domains and deep learning approaches may be better suited for generalizability and suggest the need for more attention to characteristics that vary between domains. Until further improvement is realized, caution is warranted when applying AU classifiers from one domain to another.
Collapse
Affiliation(s)
| | - Jeffrey F Cohn
- Department of Psychology, University of Pittsburgh, Pittsburgh, PA, USA
| | - László A Jeni
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Zheng Zhang
- Department of Computer Science, State University of New York at Binghamton, USA
| | - Lijun Yin
- Department of Computer Science, State University of New York at Binghamton, USA
| | - Qiang Ji
- Rensselaer Polytechnic Institute, Troy, NY, USA
| |
Collapse
|
10
|
Ertugrul IO, Cohn JF, Jeni LA, Zhang Z, Yin L, Ji Q. Cross-domain AU Detection: Domains, Learning Approaches, and Measures. PROCEEDINGS OF THE ... INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION. IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION 2019; 2019. [PMID: 31749665 DOI: 10.1109/fg.2019.8756543] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Facial action unit (AU) detectors have performed well when trained and tested within the same domain. Do AU detectors transfer to new domains in which they have not been trained? To answer this question, we review literature on cross-domain transfer and conduct experiments to address limitations of prior research. We evaluate both deep and shallow approaches to AU detection (CNN and SVM, respectively) in two large, well-annotated, publicly available databases, Expanded BP4D+ and GFT. The databases differ in observational scenarios, participant characteristics, range of head pose, video resolution, and AU base rates. For both approaches and databases, performance decreased with change in domain, often to below the threshold needed for behavioral research. Decreases were not uniform, however. They were more pronounced for GFT than for Expanded BP4D+ and for shallow relative to deep learning. These findings suggest that more varied domains and deep learning approaches may be better suited for promoting generalizability. Until further improvement is realized, caution is warranted when applying AU classifiers from one domain to another.
Collapse
Affiliation(s)
| | - Jeffrey F Cohn
- Department of Psychology, University of Pittsburgh, Pittsburgh, PA, USA
| | - László A Jeni
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Zheng Zhang
- Department of Computer Science, State University of New York at Binghamton, USA
| | - Lijun Yin
- Department of Computer Science, State University of New York at Binghamton, USA
| | - Qiang Ji
- Rensselaer Polytechnic Institute, Troy, NY, USA
| |
Collapse
|
11
|
Predicting Group Contribution Behaviour in a Public Goods Game from Face-to-Face Communication. SENSORS 2019; 19:s19122786. [PMID: 31234293 PMCID: PMC6632011 DOI: 10.3390/s19122786] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Revised: 06/12/2019] [Accepted: 06/17/2019] [Indexed: 12/02/2022]
Abstract
Experimental economic laboratories run many studies to test theoretical predictions with actual human behaviour, including public goods games. With this experiment, participants in a group have the option to invest money in a public account or to keep it. All the invested money is multiplied and then evenly distributed. This structure incentivizes free riding, resulting in contributions to the public goods declining over time. Face-to-face Communication (FFC) diminishes free riding and thus positively affects contribution behaviour, but the question of how has remained mostly unknown. In this paper, we investigate two communication channels, aiming to explain what promotes cooperation and discourages free riding. Firstly, the facial expressions of the group in the 3-minute FFC videos are automatically analysed to predict the group behaviour towards the end of the game. The proposed automatic facial expressions analysis approach uses a new group activity descriptor and utilises random forest classification. Secondly, the contents of FFC are investigated by categorising strategy-relevant topics and using meta-data. The results show that it is possible to predict whether the group will fully contribute to the end of the games based on facial expression data from three minutes of FFC, but deeper understanding requires a larger dataset. Facial expression analysis and content analysis found that FFC and talking until the very end had a significant, positive effect on the contributions.
Collapse
|
12
|
Chu WS, De la Torre F, Cohn JF. Learning Facial Action Units with Spatiotemporal Cues and Multi-label Sampling. IMAGE AND VISION COMPUTING 2019; 81:1-14. [PMID: 30524157 PMCID: PMC6277040 DOI: 10.1016/j.imavis.2018.10.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Facial action units (AUs) may be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during training the network, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.
Collapse
Affiliation(s)
- Wen-Sheng Chu
- Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
| | | | - Jeffrey F Cohn
- Department of Psychology, University of Pittsburgh, Pittsburgh, USA
| |
Collapse
|
13
|
Hammal Z, Chu WS, Cohn JF, Heike C, Speltz ML. Automatic Action Unit Detection in Infants Using Convolutional Neural Network. INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION AND WORKSHOPS : [PROCEEDINGS]. ACII (CONFERENCE) 2017; 2017:216-221. [PMID: 29862131 PMCID: PMC5976252 DOI: 10.1109/acii.2017.8273603] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Action unit detection in infants relative to adults presents unique challenges. Jaw contour is less distinct, facial texture is reduced, and rapid and unusual facial movements are common. To detect facial action units in spontaneous behavior of infants, we propose a multi-label Convolutional Neural Network (CNN). Eighty-six infants were recorded during tasks intended to elicit enjoyment and frustration. Using an extension of FACS for infants (Baby FACS), over 230,000 frames were manually coded for ground truth. To control for chance agreement, inter-observer agreement between Baby-FACS coders was quantified using free-margin kappa. Kappa coefficients ranged from 0.79 to 0.93, which represents high agreement. The multi-label CNN achieved comparable agreement with manual coding. Kappa ranged from 0.69 to 0.93. Importantly, the CNN-based AU detection revealed the same change in findings with respect to infant expressiveness between tasks. While further research is needed, these findings suggest that automatic AU detection in infants is a viable alternative to manual coding of infant facial expression.
Collapse
Affiliation(s)
- Zakia Hammal
- Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
| | - Wen-Sheng Chu
- Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
| | - Jeffrey F Cohn
- Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
- Department of Psychology, University of Pittsburgh, Pittsburgh, USA
| | | | | |
Collapse
|