1
|
Tan G, Wan Z, Wang Y, Cao Y, Zha ZJ. Tackling Event-Based Lip-Reading by Exploring Multigrained Spatiotemporal Clues. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8279-8291. [PMID: 39288038 DOI: 10.1109/tnnls.2024.3440495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/19/2024]
Abstract
Automatic lip-reading (ALR) is the task of recognizing words based on visual information obtained from the speaker's lip movements. In this study, we introduce event cameras, a novel type of sensing device, for ALR. Event cameras offer both technical and application advantages over conventional cameras for ALR due to their higher temporal resolution, less redundant visual information, and lower power consumption. To recognize words from the event data, we propose a novel multigrained spatiotemporal features learning framework, which is capable of perceiving fine-grained spatiotemporal features from microsecond time-resolved event data. Specifically, we first convert the event data into event frames of multiple temporal resolutions to avoid losing too much visual information at the event representation stage. Then, they are fed into a multibranch subnetwork where the branch operating on low-rate frames can perceive spatially complete but temporally coarse features, while the branch operating on high frame rate can perceive spatially coarse but temporally fine features. Thus, fine-grained spatial and temporal features can be simultaneously learned by integrating the features perceived by different branches. Furthermore, to model the temporal relationships in the event stream, we design a temporal aggregation subnetwork to aggregate the features perceived by the multibranch subnetwork. In addition, we collect two event-based lip-reading datasets (DVS-Lip and DVS-LRW100) for the study of the event-based lip-reading task. Experimental results demonstrate the superiority of the proposed model over the state-of-the-art event-based action recognition models and video-based lip-reading models.
Collapse
|
2
|
Zhong L, Chen Z, Wu Z, Du S, Chen Z, Wang S. Learnable Graph Convolutional Network With Semisupervised Graph Information Bottleneck. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:433-446. [PMID: 37847634 DOI: 10.1109/tnnls.2023.3322739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
Graph convolutional network (GCN) has gained widespread attention in semisupervised classification tasks. Recent studies show that GCN-based methods have achieved decent performance in numerous fields. However, most of the existing methods generally adopted a fixed graph that cannot dynamically capture both local and global relationships. This is because the hidden and important relationships may not be directed exhibited in the fixed structure, causing the degraded performance of semisupervised classification tasks. Moreover, the missing and noisy data yielded by the fixed graph may result in wrong connections, thereby disturbing the representation learning process. To cope with these issues, this article proposes a learnable GCN-based framework, aiming to obtain the optimal graph structures by jointly integrating graph learning and feature propagation in a unified network. Besides, to capture the optimal graph representations, this article designs dual-GCN-based meta-channels to simultaneously explore local and global relations during the training process. To minimize the interference of the noisy data, a semisupervised graph information bottleneck (SGIB) is introduced to conduct the graph structural learning (GSL) for acquiring the minimal sufficient representations. Concretely, SGIB aims to maximize the mutual information of both the same and different meta-channels by designing the constraints between them, thereby improving the node classification performance in the downstream tasks. Extensive experimental results on real-world datasets demonstrate the robustness of the proposed model, which outperforms state-of-the-art methods with fixed-structure graphs.
Collapse
|
3
|
Tang S, Zhao Y, Lv H, Sun M, Feng Y, Zhang Z. Adaptive Optimization and Dynamic Representation Method for Asynchronous Data Based on Regional Correlation Degree. SENSORS (BASEL, SWITZERLAND) 2024; 24:7430. [PMID: 39685963 DOI: 10.3390/s24237430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Revised: 11/11/2024] [Accepted: 11/19/2024] [Indexed: 12/18/2024]
Abstract
Event cameras, as bio-inspired visual sensors, offer significant advantages in their high dynamic range and high temporal resolution for visual tasks. These capabilities enable efficient and reliable motion estimation even in the most complex scenes. However, these advantages come with certain trade-offs. For instance, current event-based vision sensors have low spatial resolution, and the process of event representation can result in varying degrees of data redundancy and incompleteness. Additionally, due to the inherent characteristics of event stream data, they cannot be utilized directly; pre-processing steps such as slicing and frame compression are required. Currently, various pre-processing algorithms exist for slicing and compressing event streams. However, these methods fall short when dealing with multiple subjects moving at different and varying speeds within the event stream, potentially exacerbating the inherent deficiencies of the event information flow. To address this longstanding issue, we propose a novel and efficient Asynchronous Spike Dynamic Metric and Slicing algorithm (ASDMS). ASDMS adaptively segments the event stream into fragments of varying lengths based on the spatiotemporal structure and polarity attributes of the events. Moreover, we introduce a new Adaptive Spatiotemporal Subject Surface Compensation algorithm (ASSSC). ASSSC compensates for missing motion information in the event stream and removes redundant information, thereby achieving better performance and effectiveness in event stream segmentation compared to existing event representation algorithms. Additionally, after compressing the processed results into frame images, the imaging quality is significantly improved. Finally, we propose a new evaluation metric, the Actual Performance Efficiency Discrepancy (APED), which combines actual distortion rate and event information entropy to quantify and compare the effectiveness of our method against other existing event representation methods. The final experimental results demonstrate that our event representation method outperforms existing approaches and addresses the shortcomings of current methods in handling event streams with multiple entities moving at varying speeds simultaneously.
Collapse
Affiliation(s)
- Sichao Tang
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yuchen Zhao
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
| | - Hengyi Lv
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
| | - Ming Sun
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
| | - Yang Feng
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
| | - Zeshu Zhang
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
| |
Collapse
|
4
|
Sabater A, Montesano L, Murillo AC. Event Transformer +. A Multi-Purpose Solution for Efficient Event Data Processing. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:16013-16020. [PMID: 37656643 DOI: 10.1109/tpami.2023.3311336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/03/2023]
Abstract
Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/VR and autonomous driving. Current top-performing methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms, while event-aware methods do not perform as well. We propose Event Transformer +, that improves our seminal work EvT with a refined patch-based event representation and a more robust backbone to achieve more accurate results, while still benefiting from event-data sparsity to increase its efficiency. Additionally, we show how our system can work with different data modalities and propose specific output heads, for event-stream classification (i.e. action recognition) and per-pixel predictions (dense depth estimation). Evaluation results show better performance to the state-of-the-art while requiring minimal computation resources, both on GPU and CPU.
Collapse
|
5
|
Yao M, Zhang H, Zhao G, Zhang X, Wang D, Cao G, Li G. Sparser spiking activity can be better: Feature Refine-and-Mask spiking neural network for event-based visual recognition. Neural Netw 2023; 166:410-423. [PMID: 37549609 DOI: 10.1016/j.neunet.2023.07.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 02/23/2023] [Accepted: 07/05/2023] [Indexed: 08/09/2023]
Abstract
Event-based visual, a new visual paradigm with bio-inspired dynamic perception and μs level temporal resolution, has prominent advantages in many specific visual scenarios and gained much research interest. Spiking neural network (SNN) is naturally suitable for dealing with event streams due to its temporal information processing capability and event-driven nature. However, existing works SNN neglect the fact that the input event streams are spatially sparse and temporally non-uniform, and just treat these variant inputs equally. This situation interferes with the effectiveness and efficiency of existing SNNs. In this paper, we propose the feature Refine-and-Mask SNN (RM-SNN), which has the ability of self-adaption to regulate the spiking response in a data-dependent way. We use the Refine-and-Mask (RM) module to refine all features and mask the unimportant features to optimize the membrane potential of spiking neurons, which in turn drops the spiking activity. Inspired by the fact that not all events in spatio-temporal streams are task-relevant, we execute the RM module in both temporal and channel dimensions. Extensive experiments on seven event-based benchmarks, DVS128 Gesture, DVS128 Gait, CIFAR10-DVS, N-Caltech101, DailyAction-DVS, UCF101-DVS, and HMDB51-DVS demonstrate that under the multi-scale constraints of input time window, RM-SNN can significantly reduce the network average spiking activity rate while improving the task performance. In addition, by visualizing spiking responses, we analyze why sparser spiking activity can be better. Code.
Collapse
Affiliation(s)
- Man Yao
- School of Automation Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China; Peng Cheng Laboratory, Shenzhen 518000, China.
| | - Hengyu Zhang
- School of Automation Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China; Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518000, China.
| | - Guangshe Zhao
- School of Automation Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.
| | - Xiyu Zhang
- School of Automation Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.
| | - Dingheng Wang
- Northwest Institute of Mechanical & Electrical Engineering, Xianyang, Shaanxi, China.
| | - Gang Cao
- Beijing Academy of Artificial Intelligence, Beijing 100089, China
| | - Guoqi Li
- Peng Cheng Laboratory, Shenzhen 518000, China; Institute of Automation, Chinese Academy of Sciences, Beijing 100089, China.
| |
Collapse
|
6
|
Wu Z, Shen Y, Zhang J, Liang H, Zhao R, Li H, Xiong J, Zhang X, Chua Y. BIDL: a brain-inspired deep learning framework for spatiotemporal processing. Front Neurosci 2023; 17:1213720. [PMID: 37564366 PMCID: PMC10410154 DOI: 10.3389/fnins.2023.1213720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Accepted: 06/22/2023] [Indexed: 08/12/2023] Open
Abstract
Brain-inspired deep spiking neural network (DSNN) which emulates the function of the biological brain provides an effective approach for event-stream spatiotemporal perception (STP), especially for dynamic vision sensor (DVS) signals. However, there is a lack of generalized learning frameworks that can handle various spatiotemporal modalities beyond event-stream, such as video clips and 3D imaging data. To provide a unified design flow for generalized spatiotemporal processing (STP) and to investigate the capability of lightweight STP processing via brain-inspired neural dynamics, this study introduces a training platform called brain-inspired deep learning (BIDL). This framework constructs deep neural networks, which leverage neural dynamics for processing temporal information and ensures high-accuracy spatial processing via artificial neural network layers. We conducted experiments involving various types of data, including video information processing, DVS information processing, 3D medical imaging classification, and natural language processing. These experiments demonstrate the efficiency of the proposed method. Moreover, as a research framework for researchers in the fields of neuroscience and machine learning, BIDL facilitates the exploration of different neural models and enables global-local co-learning. For easily fitting to neuromorphic chips and GPUs, the framework incorporates several optimizations, including iteration representation, state-aware computational graph, and built-in neural functions. This study presents a user-friendly and efficient DSNN builder for lightweight STP applications and has the potential to drive future advancements in bio-inspired research.
Collapse
Affiliation(s)
- Zhenzhi Wu
- Lynxi Technologies, Co. Ltd., Beijing, China
| | - Yangshu Shen
- Lynxi Technologies, Co. Ltd., Beijing, China
- Department of Precision Instruments and Mechanology, Tsinghua University, Beijing, China
| | - Jing Zhang
- Lynxi Technologies, Co. Ltd., Beijing, China
| | - Huaju Liang
- Neuromorphic Computing Laboratory, China Nanhu Academy of Electronics and Information Technology (CNAEIT), Jiaxing, Zhejiang, China
| | | | - Han Li
- Lynxi Technologies, Co. Ltd., Beijing, China
| | - Jianping Xiong
- Department of Precision Instruments and Mechanology, Tsinghua University, Beijing, China
| | - Xiyu Zhang
- School of Automation Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi, China
| | - Yansong Chua
- Neuromorphic Computing Laboratory, China Nanhu Academy of Electronics and Information Technology (CNAEIT), Jiaxing, Zhejiang, China
| |
Collapse
|
7
|
Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J. Human Action Recognition From Various Data Modalities: A Review. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:3200-3225. [PMID: 35700242 DOI: 10.1109/tpami.2022.3183112] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Human Action Recognition (HAR) aims to understand human behavior and assign a label to each action. It has a wide range of applications, and therefore has been attracting increasing attention in the field of computer vision. Human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, radar, and WiFi signal, which encode different sources of useful yet distinct information and have various advantages depending on the application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this article, we present a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality. Specifically, we review the current mainstream deep learning methods for single data modalities and multiple data modalities, including the fusion-based and the co-learning-based frameworks. We also present comparative results on several benchmark datasets for HAR, together with insightful observations and inspiring future research directions.
Collapse
|
8
|
Baldwin RW, Liu R, Almatrafi M, Asari V, Hirakawa K. Time-Ordered Recent Event (TORE) Volumes for Event Cameras. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:2519-2532. [PMID: 35503820 DOI: 10.1109/tpami.2022.3172212] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Event cameras are an exciting, new sensor modality enabling high-speed imaging with extremely low-latency and wide dynamic range. Unfortunately, most machine learning architectures are not designed to directly handle sparse data, like that generated from event cameras. Many state-of-the-art algorithms for event cameras rely on interpolated event representations-obscuring crucial timing information, increasing the data volume, and limiting overall network performance. This paper details an event representation called Time-Ordered Recent Event (TORE) volumes. TORE volumes are designed to compactly store raw spike timing information with minimal information loss. This bio-inspired design is memory efficient, computationally fast, avoids time-blocking (i.e., fixed and predefined frame rates), and contains "local memory" from past data. The design is evaluated on a wide range of challenging tasks (e.g., event denoising, image reconstruction, classification, and human pose estimation) and is shown to dramatically improve state-of-the-art performance. TORE volumes are an easy-to-implement replacement for any algorithm currently utilizing event representations.
Collapse
|
9
|
Chakraborty B, Mukhopadhyay S. Heterogeneous recurrent spiking neural network for spatio-temporal classification. Front Neurosci 2023; 17:994517. [PMID: 36793542 PMCID: PMC9922697 DOI: 10.3389/fnins.2023.994517] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 01/04/2023] [Indexed: 02/01/2023] Open
Abstract
Spiking Neural Networks are often touted as brain-inspired learning models for the third wave of Artificial Intelligence. Although recent SNNs trained with supervised backpropagation show classification accuracy comparable to deep networks, the performance of unsupervised learning-based SNNs remains much lower. This paper presents a heterogeneous recurrent spiking neural network (HRSNN) with unsupervised learning for spatio-temporal classification of video activity recognition tasks on RGB (KTH, UCF11, UCF101) and event-based datasets (DVS128 Gesture). We observed an accuracy of 94.32% for the KTH dataset, 79.58% and 77.53% for the UCF11 and UCF101 datasets, respectively, and an accuracy of 96.54% on the event-based DVS Gesture dataset using the novel unsupervised HRSNN model. The key novelty of the HRSNN is that the recurrent layer in HRSNN consists of heterogeneous neurons with varying firing/relaxation dynamics, and they are trained via heterogeneous spike-time-dependent-plasticity (STDP) with varying learning dynamics for each synapse. We show that this novel combination of heterogeneity in architecture and learning method outperforms current homogeneous spiking neural networks. We further show that HRSNN can achieve similar performance to state-of-the-art backpropagation trained supervised SNN, but with less computation (fewer neurons and sparse connection) and less training data.
Collapse
|
10
|
Dong J, Jiang R, Xiao R, Yan R, Tang H. Event stream learning using spatio-temporal event surface. Neural Netw 2022; 154:543-559. [DOI: 10.1016/j.neunet.2022.07.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 06/12/2022] [Accepted: 07/10/2022] [Indexed: 11/29/2022]
|
11
|
Wang Y, Zhang X, Shen Y, Du B, Zhao G, Cui L, Wen H. Event-Stream Representation for Human Gaits Identification Using Deep Neural Networks. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:3436-3449. [PMID: 33502972 DOI: 10.1109/tpami.2021.3054886] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Dynamic vision sensors (event cameras) have recently been introduced to solve a number of different vision tasks such as object recognition, activities recognition, tracking, etc. Compared with the traditional RGB sensors, the event cameras have many unique advantages such as ultra low resources consumption, high temporal resolution and much larger dynamic range. However, these cameras only produce noisy and asynchronous events of intensity changes, i.e., event-streams rather than frames, where conventional computer vision algorithms can't be directly applied. In our opinion the key challenge for improving the performance of event cameras in vision tasks is finding the appropriate representations of the event-streams so that cutting-edge learning approaches can be applied to fully uncover the spatio-temporal information contained in the event-streams. In this paper, we focus on the event-based human gait identification task and investigate the possible representations of the event-streams when deep neural networks are applied as the classifier. We propose new event-based gait recognition approaches basing on two different representations of the event-stream, i.e., graph and image-like representations, and use graph-based convolutional network (GCN) and convolutional neural networks (CNN) respectively to recognize gait from the event-streams. The two approaches are termed as EV-Gait-3DGraph and EV-Gait-IMG. To evaluate the performance of the proposed approaches, we collect two event-based gait datasets, one from real-world experiments and the other by converting the publicly available RGB gait recognition benchmark CASIA-B. Extensive experiments show that EV-Gait-3DGraph achieves significantly higher recognition accuracy than other competing methods when sufficient training samples are available. However, EV-Gait-IMG converges more quickly than graph-based approaches while training and shows good accuracy with only few number of training samples (less than ten). So image-like presentation is preferable when the amount of training data is limited.
Collapse
|
12
|
Wan X, Cen L, Chen X, Xie Y. A novel multiple temporal-spatial convolution network for anode current signals classification. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01595-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
13
|
Ralph N, Joubert D, Jolley A, Afshar S, Tothill N, van Schaik A, Cohen G. Real-Time Event-Based Unsupervised Feature Consolidation and Tracking for Space Situational Awareness. Front Neurosci 2022; 16:821157. [PMID: 35600627 PMCID: PMC9120364 DOI: 10.3389/fnins.2022.821157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 04/04/2022] [Indexed: 11/19/2022] Open
Abstract
Earth orbit is a limited natural resource that hosts a vast range of vital space-based systems that support the international community's national, commercial and defence interests. This resource is rapidly becoming depleted with over-crowding in high demand orbital slots and a growing presence of space debris. We propose the Fast Iterative Extraction of Salient targets for Tracking Asynchronously (FIESTA) algorithm as a robust, real-time and reactive approach to optical Space Situational Awareness (SSA) using Event-Based Cameras (EBCs) to detect, localize, and track Resident Space Objects (RSOs) accurately and timely. We address the challenges of the asynchronous nature and high temporal resolution output of the EBC accurately, unsupervised and with few tune-able parameters using concepts established in the neuromorphic and conventional tracking literature. We show this algorithm is capable of highly accurate in-frame RSO velocity estimation and average sub-pixel localization in a simulated test environment to distinguish the capabilities of the EBC and optical setup from the proposed tracking system. This work is a fundamental step toward accurate end-to-end real-time optical event-based SSA, and developing the foundation for robust closed-form tracking evaluated using standardized tracking metrics.
Collapse
Affiliation(s)
- Nicholas Ralph
- International Centre for Neuromorphic Engineering, MARCS Institute for Brain Behaviour and Development, Western Sydney University, Werrington, NSW, Australia
- *Correspondence: Nicholas Ralph
| | - Damien Joubert
- International Centre for Neuromorphic Engineering, MARCS Institute for Brain Behaviour and Development, Western Sydney University, Werrington, NSW, Australia
| | - Andrew Jolley
- International Centre for Neuromorphic Engineering, MARCS Institute for Brain Behaviour and Development, Western Sydney University, Werrington, NSW, Australia
- Air and Space Power Development Centre, Royal Australian Air Force, Canberra, ACT, Australia
| | - Saeed Afshar
- International Centre for Neuromorphic Engineering, MARCS Institute for Brain Behaviour and Development, Western Sydney University, Werrington, NSW, Australia
| | - Nicholas Tothill
- International Centre for Neuromorphic Engineering, MARCS Institute for Brain Behaviour and Development, Western Sydney University, Werrington, NSW, Australia
| | - André van Schaik
- International Centre for Neuromorphic Engineering, MARCS Institute for Brain Behaviour and Development, Western Sydney University, Werrington, NSW, Australia
| | - Gregory Cohen
- International Centre for Neuromorphic Engineering, MARCS Institute for Brain Behaviour and Development, Western Sydney University, Werrington, NSW, Australia
| |
Collapse
|
14
|
Annamalai L, Ramanathan V, Thakur CS. Event-LSTM: An Unsupervised and Asynchronous Learning-Based Representation for Event-Based Data. IEEE Robot Autom Lett 2022. [DOI: 10.1109/lra.2022.3151426] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
15
|
Xie B, Deng Y, Shao Z, Liu H, Li Y. VMV-GCN: Volumetric Multi-View Based Graph CNN for Event Stream Classification. IEEE Robot Autom Lett 2022. [DOI: 10.1109/lra.2022.3140819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
16
|
Zhang Z, Han X, Song X, Yan Y, Nie L. Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:8265-8277. [PMID: 34559652 DOI: 10.1109/tip.2021.3113791] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.
Collapse
|
17
|
Deng Y, Chen H, Chen H, Li Y. Learning From Images: A Distillation Learning Framework for Event Cameras. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4919-4931. [PMID: 33961557 DOI: 10.1109/tip.2021.3077136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Event cameras have recently drawn massive attention in the computer vision community because of their low power consumption and high response speed. These cameras produce sparse and non-uniform spatiotemporal representations of a scene. These characteristics of representations make it difficult for event-based models to extract discriminative cues (such as textures and geometric relationships). Consequently, event-based methods usually perform poorly compared to their conventional image counterparts. Considering that traditional images and event signals share considerable visual information, this paper aims to improve the feature extraction ability of event-based models by using knowledge distilled from the image domain to additionally provide explicit feature-level supervision for the learning of event data. Specifically, we propose a simple yet effective distillation learning framework, including multi-level customized knowledge distillation constraints. Our framework can significantly boost the feature extraction process for event data and is applicable to various downstream tasks. We evaluate our framework on high-level and low-level tasks, i.e., object classification and optical flow prediction. Experimental results show that our framework can effectively improve the performance of event-based models on both tasks by a large margin. Furthermore, we present a 10K dataset (CEP-DVS) for event-based object classification. This dataset consists of samples recorded under random motion trajectories that can better evaluate the motion robustness of the event-based model and is compatible with multi-modality vision tasks.
Collapse
|