1
|
Tan G, Wan Z, Wang Y, Cao Y, Zha ZJ. Tackling Event-Based Lip-Reading by Exploring Multigrained Spatiotemporal Clues. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8279-8291. [PMID: 39288038 DOI: 10.1109/tnnls.2024.3440495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/19/2024]
Abstract
Automatic lip-reading (ALR) is the task of recognizing words based on visual information obtained from the speaker's lip movements. In this study, we introduce event cameras, a novel type of sensing device, for ALR. Event cameras offer both technical and application advantages over conventional cameras for ALR due to their higher temporal resolution, less redundant visual information, and lower power consumption. To recognize words from the event data, we propose a novel multigrained spatiotemporal features learning framework, which is capable of perceiving fine-grained spatiotemporal features from microsecond time-resolved event data. Specifically, we first convert the event data into event frames of multiple temporal resolutions to avoid losing too much visual information at the event representation stage. Then, they are fed into a multibranch subnetwork where the branch operating on low-rate frames can perceive spatially complete but temporally coarse features, while the branch operating on high frame rate can perceive spatially coarse but temporally fine features. Thus, fine-grained spatial and temporal features can be simultaneously learned by integrating the features perceived by different branches. Furthermore, to model the temporal relationships in the event stream, we design a temporal aggregation subnetwork to aggregate the features perceived by the multibranch subnetwork. In addition, we collect two event-based lip-reading datasets (DVS-Lip and DVS-LRW100) for the study of the event-based lip-reading task. Experimental results demonstrate the superiority of the proposed model over the state-of-the-art event-based action recognition models and video-based lip-reading models.
Collapse
|
2
|
Bissarinova U, Rakhimzhanova T, Kenzhebalin D, Varol HA. Faces in Event Streams (FES): An Annotated Face Dataset for Event Cameras. SENSORS (BASEL, SWITZERLAND) 2024; 24:1409. [PMID: 38474947 DOI: 10.3390/s24051409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 02/13/2024] [Accepted: 02/14/2024] [Indexed: 03/14/2024]
Abstract
The use of event-based cameras in computer vision is a growing research direction. However, despite the existing research on face detection using the event camera, a substantial gap persists in the availability of a large dataset featuring annotations for faces and facial landmarks on event streams, thus hampering the development of applications in this direction. In this work, we address this issue by publishing the first large and varied dataset (Faces in Event Streams) with a duration of 689 min for face and facial landmark detection in direct event-based camera outputs. In addition, this article presents 12 models trained on our dataset to predict bounding box and facial landmark coordinates with an mAP50 score of more than 90%. We also performed a demonstration of real-time detection with an event-based camera using our models.
Collapse
Affiliation(s)
- Ulzhan Bissarinova
- Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Astana 010000, Kazakhstan
| | - Tomiris Rakhimzhanova
- Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Astana 010000, Kazakhstan
| | - Daulet Kenzhebalin
- Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Astana 010000, Kazakhstan
| | - Huseyin Atakan Varol
- Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, Astana 010000, Kazakhstan
| |
Collapse
|
3
|
Kanamaru T, Arakane T, Saitoh T. Isolated single sound lip-reading using a frame-based camera and event-based camera. Front Artif Intell 2023; 5:1070964. [PMID: 36714203 PMCID: PMC9874941 DOI: 10.3389/frai.2022.1070964] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Accepted: 12/27/2022] [Indexed: 01/13/2023] Open
Abstract
Unlike the conventional frame-based camera, the event-based camera detects changes in the brightness value for each pixel over time. This research work on lip-reading as a new application by the event-based camera. This paper proposes an event camera-based lip-reading for isolated single sound recognition. The proposed method consists of imaging from event data, face and facial feature points detection, and recognition using a Temporal Convolutional Network. Furthermore, this paper proposes a method that combines the two modalities of the frame-based camera and an event-based camera. In order to evaluate the proposed method, the utterance scenes of 15 Japanese consonants from 20 speakers were collected using an event-based camera and a video camera and constructed an original dataset. Several experiments were conducted by generating images at multiple frame rates from an event-based camera. As a result, the highest recognition accuracy was obtained in the image of the event-based camera at 60 fps. Moreover, it was confirmed that combining two modalities yields higher recognition accuracy than a single modality.
Collapse
|
4
|
Jiang R, Wang Q, Shi S, Mou X, Chen S. Flow‐assisted visual tracking using event cameras. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2021. [DOI: 10.1049/cit2.12005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Rui Jiang
- CelePixel Technology Co. Ltd 71 Nanyang Drive Singapore638075
| | - Qinyi Wang
- CelePixel Technology Co. Ltd 71 Nanyang Drive Singapore638075
- School of Electrical and Electronic Engineering Nanyang Technological University Singapore639798
| | - Shunshun Shi
- CelePixel Technology Co. Ltd 71 Nanyang Drive Singapore638075
| | - Xiaozheng Mou
- CelePixel Technology Co. Ltd 71 Nanyang Drive Singapore638075
| | - Shoushun Chen
- CelePixel Technology Co. Ltd 71 Nanyang Drive Singapore638075
- School of Electrical and Electronic Engineering Nanyang Technological University Singapore639798
| |
Collapse
|
5
|
Ronca V, Giorgi A, Rossi D, Di Florio A, Di Flumeri G, Aricò P, Sciaraffa N, Vozzi A, Tamborra L, Simonetti I, Borghini G. A Video-Based Technique for Heart Rate and Eye Blinks Rate Estimation: A Potential Solution for Telemonitoring and Remote Healthcare. SENSORS 2021; 21:s21051607. [PMID: 33668921 PMCID: PMC7956514 DOI: 10.3390/s21051607] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 02/12/2021] [Accepted: 02/20/2021] [Indexed: 11/16/2022]
Abstract
Current telemedicine and remote healthcare applications foresee different interactions between the doctor and the patient relying on the use of commercial and medical wearable sensors and internet-based video conferencing platforms. Nevertheless, the existing applications necessarily require a contact between the patient and sensors for an objective evaluation of the patient’s state. The proposed study explored an innovative video-based solution for monitoring neurophysiological parameters of potential patients and assessing their mental state. In particular, we investigated the possibility to estimate the heart rate (HR) and eye blinks rate (EBR) of participants while performing laboratory tasks by mean of facial—video analysis. The objectives of the study were focused on: (i) assessing the effectiveness of the proposed technique in estimating the HR and EBR by comparing them with laboratory sensor-based measures and (ii) assessing the capability of the video—based technique in discriminating between the participant’s resting state (Nominal condition) and their active state (Non-nominal condition). The results demonstrated that the HR and EBR estimated through the facial—video technique or the laboratory equipment did not statistically differ (p > 0.1), and that these neurophysiological parameters allowed to discriminate between the Nominal and Non-nominal states (p < 0.02).
Collapse
Affiliation(s)
- Vincenzo Ronca
- Department of Anatomical, Histological, Forensic and Orthopaedic Sciences, Sapienza University, 00185 Rome, Italy; (A.V.); (L.T.); (I.S.)
- BrainSigns srl, 00185 Rome, Italy; (A.G.); (A.D.F.); (G.D.F.); (P.A.); (N.S.)
- Correspondence: (V.R.); (G.B.); Tel.: +39-06-49910941 (V.R. & G.B.)
| | - Andrea Giorgi
- BrainSigns srl, 00185 Rome, Italy; (A.G.); (A.D.F.); (G.D.F.); (P.A.); (N.S.)
| | - Dario Rossi
- Department of Business and Management, LUISS University, 00197 Rome, Italy;
| | - Antonello Di Florio
- BrainSigns srl, 00185 Rome, Italy; (A.G.); (A.D.F.); (G.D.F.); (P.A.); (N.S.)
| | - Gianluca Di Flumeri
- BrainSigns srl, 00185 Rome, Italy; (A.G.); (A.D.F.); (G.D.F.); (P.A.); (N.S.)
- Department of Molecular Medicine, Sapienza University of Rome, 00185 Rome, Italy
- IRCCS Fondazione Santa Lucia, 00179 Rome, Italy
| | - Pietro Aricò
- BrainSigns srl, 00185 Rome, Italy; (A.G.); (A.D.F.); (G.D.F.); (P.A.); (N.S.)
- Department of Molecular Medicine, Sapienza University of Rome, 00185 Rome, Italy
- IRCCS Fondazione Santa Lucia, 00179 Rome, Italy
| | - Nicolina Sciaraffa
- BrainSigns srl, 00185 Rome, Italy; (A.G.); (A.D.F.); (G.D.F.); (P.A.); (N.S.)
- Department of Molecular Medicine, Sapienza University of Rome, 00185 Rome, Italy
| | - Alessia Vozzi
- Department of Anatomical, Histological, Forensic and Orthopaedic Sciences, Sapienza University, 00185 Rome, Italy; (A.V.); (L.T.); (I.S.)
- BrainSigns srl, 00185 Rome, Italy; (A.G.); (A.D.F.); (G.D.F.); (P.A.); (N.S.)
| | - Luca Tamborra
- Department of Anatomical, Histological, Forensic and Orthopaedic Sciences, Sapienza University, 00185 Rome, Italy; (A.V.); (L.T.); (I.S.)
- People Advisory Services Department, Ernst & Young, 00187 Rome, Italy
| | - Ilaria Simonetti
- Department of Anatomical, Histological, Forensic and Orthopaedic Sciences, Sapienza University, 00185 Rome, Italy; (A.V.); (L.T.); (I.S.)
- People Advisory Services Department, Ernst & Young, 00187 Rome, Italy
| | - Gianluca Borghini
- Department of Molecular Medicine, Sapienza University of Rome, 00185 Rome, Italy
- IRCCS Fondazione Santa Lucia, 00179 Rome, Italy
- Correspondence: (V.R.); (G.B.); Tel.: +39-06-49910941 (V.R. & G.B.)
| |
Collapse
|
6
|
Savran A, Bartolozzi C. Face Pose Alignment with Event Cameras. SENSORS (BASEL, SWITZERLAND) 2020; 20:E7079. [PMID: 33321842 PMCID: PMC7764104 DOI: 10.3390/s20247079] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 10/30/2020] [Accepted: 11/05/2020] [Indexed: 06/12/2023]
Abstract
Event camera (EC) emerges as a bio-inspired sensor which can be an alternative or complementary vision modality with the benefits of energy efficiency, high dynamic range, and high temporal resolution coupled with activity dependent sparse sensing. In this study we investigate with ECs the problem of face pose alignment, which is an essential pre-processing stage for facial processing pipelines. EC-based alignment can unlock all these benefits in facial applications, especially where motion and dynamics carry the most relevant information due to the temporal change event sensing. We specifically aim at efficient processing by developing a coarse alignment method to handle large pose variations in facial applications. For this purpose, we have prepared by multiple human annotations a dataset of extreme head rotations with varying motion intensity. We propose a motion detection based alignment approach in order to generate activity dependent pose-events that prevents unnecessary computations in the absence of pose change. The alignment is realized by cascaded regression of extremely randomized trees. Since EC sensors perform temporal differentiation, we characterize the performance of the alignment in terms of different levels of head movement speeds and face localization uncertainty ranges as well as face resolution and predictor complexity. Our method obtained 2.7% alignment failure on average, whereas annotator disagreement was 1%. The promising coarse alignment performance on EC sensor data together with a comprehensive analysis demonstrate the potential of ECs in facial applications.
Collapse
Affiliation(s)
- Arman Savran
- Department of Computer Engineering, Yasar University, 35100 Izmir, Turkey
| | - Chiara Bartolozzi
- Event-Driven Perception for Robotics, Istituto Italiano di Tecnologia, 16163 Genova, Italy;
| |
Collapse
|