1
|
Zhang J, Zhang M, Wang Y, Liu Q, Yin B, Li H, Yang X. Spiking Neural Networks with Adaptive Membrane Time Constant for Event-Based Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; PP:1009-1021. [PMID: 40031251 DOI: 10.1109/tip.2025.3533213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
The brain-inspired Spiking Neural Networks (SNNs) work in an event-driven manner and have an implicit recurrence in neuronal membrane potential to memorize information over time, which are inherently suitable to handle temporal event-based streams. Despite their temporal nature and recent approaches advancements, these methods have predominantly been assessed on event-based classification tasks. In this paper, we explore the utility of SNNs for event-based tracking tasks. Specifically, we propose a brain-inspired adaptive Leaky Integrate-and-Fire neuron (BA-LIF) that can adaptively adjust the membrane time constant according to the inputs, thereby accelerating the leakage of meaningless noise features and reducing the decay of valuable information. SNNs composed of our proposed BA-LIF neurons can achieve high performance without a careful and time-consuming trial-by-error initialization on the membrane time constant. The adaptive capability of our network is further improved by introducing an extra temporal feature aggregator (TFA) that assigns attention weights over the temporal dimension. Extensive experiments on various event-based tracking datasets validate the effectiveness of our proposed method. We further validate the generalization capability of our method by applying it to other event-classification tasks.
Collapse
|
2
|
Qin Z, Lu X, Liu D, Nie X, Yin Y, Shen J, Loui AC. Reformulating Graph Kernels for Self-Supervised Space-Time Correspondence Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:6543-6557. [PMID: 37922168 DOI: 10.1109/tip.2023.3328485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2023]
Abstract
Self-supervised space-time correspondence learning utilizing unlabeled videos holds great potential in computer vision. Most existing methods rely on contrastive learning with mining negative samples or adapting reconstruction from the image domain, which requires dense affinity across multiple frames or optical flow constraints. Moreover, video correspondence prediction models need to uncover more inherent properties of the video, such as structural information. In this work, we propose HiGraph+, a sophisticated space-time correspondence framework based on learnable graph kernels. By treating videos as a spatial-temporal graph, the learning objective of HiGraph+ is issued in a self-supervised manner, predicting the unobserved hidden graph via graph kernel methods. First, we learn the structural consistency of sub-graphs in graph-level correspondence learning. Furthermore, we introduce a spatio-temporal hidden graph loss through contrastive learning that facilitates learning temporal coherence across frames of sub-graphs and spatial diversity within the same frame. Therefore, we can predict long-term correspondences and drive the hidden graph to acquire distinct local structural representations. Then, we learn a refined representation across frames on the node-level via a dense graph kernel. The structural and temporal consistency of the graph forms the self-supervision of model training. HiGraph+ achieves excellent performance and demonstrates robustness in benchmark tests involving object, semantic part, keypoint, and instance labeling propagation tasks. Our algorithm implementations have been made publicly available at https://github.com/zyqin19/HiGraph.
Collapse
|
3
|
Lv X, Zhang S, Wang C, Zhang W, Yao H, Huang Q. Unsupervised Low-Light Video Enhancement With Spatial-Temporal Co-Attention Transformer. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:4701-4715. [PMID: 37549080 DOI: 10.1109/tip.2023.3301332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/09/2023]
Abstract
Existing low-light video enhancement methods are dominated by Convolution Neural Networks (CNNs) that are trained in a supervised manner. Due to the difficulty of collecting paired dynamic low/normal-light videos in real-world scenes, they are usually trained on synthetic, static, and uniform motion videos, which undermines their generalization to real-world scenes. Additionally, these methods typically suffer from temporal inconsistency (e.g., flickering artifacts and motion blurs) when handling large-scale motions since the local perception property of CNNs limits them to model long-range dependencies in both spatial and temporal domains. To address these problems, we propose the first unsupervised method for low-light video enhancement to our best knowledge, named LightenFormer, which models long-range intra- and inter-frame dependencies with a spatial-temporal co-attention transformer to enhance brightness while maintaining temporal consistency. Specifically, an effective but lightweight S-curve Estimation Network (SCENet) is first proposed to estimate pixel-wise S-shaped non-linear curves (S-curves) to adaptively adjust the dynamic range of an input video. Next, to model the temporal consistency of the video, we present a Spatial-Temporal Refinement Network (STRNet) to refine the enhanced video. The core module of STRNet is a novel Spatial-Temporal Co-attention Transformer (STCAT), which exploits multi-scale self- and cross-attention interactions to capture long-range correlations in both spatial and temporal domains among frames for implicit motion estimation. To achieve unsupervised training, we further propose two non-reference loss functions based on the invertibility of the S-curve and the noise independence among frames. Extensive experiments on the SDSD and LLIV-Phone datasets demonstrate that our LightenFormer outperforms state-of-the-art methods.
Collapse
|
4
|
Yang Y, Gu X. Joint Correlation and Attention Based Feature Fusion Network for Accurate Visual Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1705-1715. [PMID: 37028050 DOI: 10.1109/tip.2023.3251027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Correlation operation and attention mechanism are two popular feature fusion approaches which play an important role in visual object tracking. However, the correlation-based tracking networks are sensitive to location information but loss some context semantics, while the attention-based tracking networks can make full use of rich semantic information but ignore the position distribution of the tracked object. Therefore, in this paper, we propose a novel tracking framework based on joint correlation and attention networks, termed as JCAT, which can effectively combine the advantages of these two complementary feature fusion approaches. Concretely, the proposed JCAT approach adopts parallel correlation and attention branches to generate position and semantic features. Then the fusion features are obtained by directly adding the location feature and semantic feature. Finally, the fused features are fed into the segmentation network to generate the pixel-wise state estimation of the object. Furthermore, we develop a segmentation memory bank and an online sample filtering mechanism for robust segmentation and tracking. The extensive experimental results on eight challenging visual tracking benchmarks show that the proposed JCAT tracker achieves very promising tracking performance and sets a new state-of-the-art on the VOT2018 benchmark.
Collapse
|
5
|
Song X, Zhou D, Li W, Dai Y, Shen Z, Zhang L, Li H. TUSR-Net: Triple Unfolding Single Image Dehazing With Self-Regularization and Dual Feature to Pixel Attention. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1231-1244. [PMID: 37022903 DOI: 10.1109/tip.2023.3234701] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Single image dehazing is a challenging and ill-posed problem due to severe information degeneration of images captured in hazy conditions. Remarkable progresses have been achieved by deep-learning based image dehazing methods, where residual learning is commonly used to separate the hazy image into clear and haze components. However, the nature of low similarity between haze and clear components is commonly neglected, while the lack of constraint of contrastive peculiarity between the two components always restricts the performance of these approaches. To deal with these problems, we propose an end-to-end self-regularized network (TUSR-Net) which exploits the contrastive peculiarity of different components of the hazy image, i.e, self-regularization (SR). In specific, the hazy image is separated into clear and hazy components and constraint between different image components, i.e., self-regularization, is leveraged to pull the recovered clear image closer to groundtruth, which largely promotes the performance of image dehazing. Meanwhile, an effective triple unfolding framework combined with dual feature to pixel attention is proposed to intensify and fuse the intermediate information in feature, channel and pixel levels, respectively, thus features with better representational ability can be obtained. Our TUSR-Net achieves better trade-off between performance and parameter size with weight-sharing strategy and is much more flexible. Experiments on various benchmarking datasets demonstrate the superiority of our TUSR-Net over state-of-the-art single image dehazing methods.
Collapse
|
6
|
Wang X, Chen Z, Jiang B, Tang J, Luo B, Tao D. Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-Based Beam Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6239-6254. [PMID: 36166563 DOI: 10.1109/tip.2022.3208437] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame. However, we found that this may be not an optimal choice, especially when encountering challenging tracking scenarios such as heavy occlusion and fast motion. In particular, if a tracker drifts, errors will be accumulated and would further make response scores estimated by the tracker unreliable in future frames. To address this issue, we propose to maintain multiple tracking trajectories and apply beam search strategy for visual tracking, so that the trajectory with fewer accumulated errors can be identified. Accordingly, this paper introduces a novel multi-agent reinforcement learning based beam search tracking strategy, termed BeamTracking. It is mainly inspired by the image captioning task, which takes an image as input and generates diverse descriptions using beam search algorithm. Accordingly, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. Each maintained trajectory is associated with an agent to perform the decision-making and determine what actions should be taken to update related information. More specifically, using the classification-based tracker as the baseline, we first adopt bi-GRU to encode the target feature, proposal feature, and its response score into a unified state representation. The state feature and greedy search result are then fed into the first agent for independent action selection. Afterwards, the output action and state features are fed into the subsequent agent for diverse results prediction. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
Collapse
|
7
|
Song M, Song W, Yang G, Chen C. Improving RGB-D Salient Object Detection via Modality-Aware Decoder. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6124-6138. [PMID: 36112559 DOI: 10.1109/tip.2022.3205747] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Most existing RGB-D salient object detection (SOD) methods are primarily focusing on cross-modal and cross-level saliency fusion, which has been proved to be efficient and effective. However, these methods still have a critical limitation, i.e., their fusion patterns - typically the combination of selective characteristics and its variations, are too highly dependent on the network's non-linear adaptability. In such methods, the balances between RGB and D (Depth) are formulated individually considering the intermediate feature slices, but the relation at the modality level may not be learned properly. The optimal RGB-D combinations differ depending on the RGB-D scenarios, and the exact complementary status is frequently determined by multiple modality-level factors, such as D quality, the complexity of the RGB scene, and degree of harmony between them. Therefore, given the existing approaches, it may be difficult for them to achieve further performance breakthroughs, as their methodologies belong to some methods that are somewhat less modality sensitive. To conquer this problem, this paper presents the Modality-aware Decoder (MaD). The critical technical innovations include a series of feature embedding, modality reasoning, and feature back-projecting and collecting strategies, all of which upgrade the widely-used multi-scale and multi-level decoding process to be modality-aware. Our MaD achieves competitive performance over other state-of-the-art (SOTA) models without using any fancy tricks in the decoder's design. Codes and results will be publicly available at https://github.com/MengkeSong/MaD.
Collapse
|
8
|
RANet: A Reliability-Guided Aggregation Network for Hyperspectral and RGB Fusion Tracking. REMOTE SENSING 2022. [DOI: 10.3390/rs14122765] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Object tracking based on RGB images may fail when the color of the tracked object is similar to that of the background. Hyperspectral images with rich spectral features can provide more information for RGB-based trackers. However, there is no fusion tracking algorithm based on hyperspectral and RGB images. In this paper, we propose a reliability-guided aggregation network (RANet) for hyperspectral and RGB tracking, which guides the combination of hyperspectral information and RGB information through modality reliability to improve tracking performance. Specifically, a dual branch based on the Transformer Tracking (TransT) structure is constructed to obtain the information of hyperspectral and RGB modalities. Then, a classification response aggregation module is designed to combine the different modality information by fusing the response predicted through the classification head. Finally, the reliability of different modalities is also considered in the aggregation module to guide the aggregation of the different modality information. Massive experimental results on the public dataset composed of hyperspectral and RGB image sequences show that the performance of the tracker based on our fusion method is better than that of the corresponding single-modality tracker, which fully proves the effectiveness of the fusion method. Among them, the RANet tracker based on the TransT tracker achieves the best performance accuracy of 0.709, indicating the effectiveness and superiority of the RANet tracker.
Collapse
|
9
|
An Accurate Refinement Pathway for Visual Tracking. INFORMATION 2022. [DOI: 10.3390/info13030147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Recently, in the field of visual object tracking, visual object tracking algorithms combined with visual object segmentation have achieved impressive results while using mask to label targets in the VOT2020 dataset. Most of the trackers get the object mask by increasing the resolution through multiple upsampling modules and gradually get the mask by summing with the features in the backbone network. However, this refinement pathway does not fully consider the spatial information of the backbone features, and therefore, the segmentation results are not perfect. In this paper, the cross-stage and cross-resolution (CSCR) module is proposed for optimizing the segmentation effect. This module makes full use of the semantic information of high-level features and the spatial information of low-level features, and fuses them by skip connections to achieve a very accurate segmentation effect. Experiments were conducted on the VOT dataset, and the experimental results outperformed other excellent trackers and verified the effectiveness of the algorithm in this paper.
Collapse
|
10
|
|