1
|
Arthanari S, Elayaperumal D, Joo YH. Learning temporal regularized spatial-aware deep correlation filter tracking via adaptive channel selection. Neural Netw 2025; 186:107210. [PMID: 39987711 DOI: 10.1016/j.neunet.2025.107210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Revised: 06/11/2024] [Accepted: 01/22/2025] [Indexed: 02/25/2025]
Abstract
In recent years, deep correlation filters have demonstrated outstanding performance in robust object tracking. Nevertheless, the correlation filters encounter challenges in managing huge occlusion, target deviation, and background clutter due to the lack of effective utilization of previous target information. To overcome these issues, we propose a novel temporal regularized spatial-aware deep correlation filter tracking via adaptive channel selection. To do this, we first presented the adaptive channel selection approach, which efficiently handles target deviation by adaptively selecting suitable channels during the learning stage. In addition, the adaptive channel selection method allows for dynamic adjustments to the filter based on the unique characteristics of the target object. This adaptability enhances the tracker's flexibility, making it well-suited for diverse tracking scenarios. Second, we propose the spatial-aware correlation filter with dynamic spatial constraints, which effectively reduces the filter response in the complex background region by distinguishing between the foreground and background regions in the response map. Hence, the target can be easily identified within the foreground region. Third, we designed a temporal regularization approach that improves the target accuracy when the case of large appearance variations. Additionally, this temporal regularization method considers the present and previous frames of the target region, which significantly enhances the tracking ability by utilizing historical information. Finally, we present a comprehensive experiments analysis of the OTB-2013, OTB-2015, TempleColor-128, UAV-123, UAVDT, and DTB-70 benchmark datasets to demonstrate the effectiveness of the proposed approach against the state-of-the-trackers.
Collapse
Affiliation(s)
- Sathiyamoorthi Arthanari
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea
| | - Dinesh Elayaperumal
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea
| | - Young Hoon Joo
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea.
| |
Collapse
|
2
|
Mao K, Hong X, Fan X, Zuo W. A Swiss Army Knife for Tracking by Natural Language Specification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2254-2268. [PMID: 40168206 DOI: 10.1109/tip.2025.3553290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/03/2025]
Abstract
Tracking by natural language specification requires trackers to jointly perform grounding and tracking tasks. Existing methods either use separate models or a single shared network, failing to account for the link and diversity between tasks jointly. In this paper, we propose a novel framework that performs dynamic task switching to customize its network path routing for each task within a unified model. For this purpose, we design a task-switchable attention module, which enables the acquisition of modal relation patterns with different dominant modalities for each task via dynamic task switching. In addition, to alleviate the inconsistency between the static language description and the dynamic target appearance during tracking, we propose a language renovation mechanism that renovates the initial language online via visual-context-aware linguistic prompting. Extensive experimental results on five datasets demonstrate that the proposed method performs favorably against state-of-the-art approaches for both grounding and tracking. Our project will be available at: https://github.com/mkg1204/SAKTrack.
Collapse
|
3
|
Moorthy S, Joo YH. Learning dynamic spatial-temporal regularized correlation filter tracking with response deviation suppression via multi-feature fusion. Neural Netw 2023; 167:360-379. [PMID: 37673025 DOI: 10.1016/j.neunet.2023.08.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 08/10/2023] [Accepted: 08/12/2023] [Indexed: 09/08/2023]
Abstract
Visual object tracking (VOT) for intelligent video surveillance has attracted great attention in the current research community, thanks to advances in computer vision and camera technology. Meanwhile, discriminative correlation filter (DCF) trackers garnered significant interest owing to their high accuracy and low computing cost. Many researchers have introduced spatial and temporal regularization into the DCF framework to achieve a more robust appearance model and further improve tracking performance. However, these algorithms typically set fixed spatial and temporal regularization parameters, which limit flexibility and adaptability under cluttered and challenging scenarios. To overcome these problems, in this work, we propose a new dynamic spatial-temporal regularization for the DCF tracking model that emphasizes the filter to concentrate on more reliable regions during the training stage. Furthermore, we present a response deviation-suppressed regularization term for responses to encourage temporal consistency and avoid model degradation by suppressing relative response changes between two consecutive frames. Moreover, we introduce a multi-memory tracking framework to exploit various features and each memory contributes to tracking the target across all frames. Significant experiments on the OTB-2013, OTB-2015, TC-128, UAV-123, UAVDT, and DTB-70 datasets have revealed that the performance thereof outperformed many state-of-the-art trackers based on DCF and deep-based frameworks in terms of tracking accuracy and tracking success rate.
Collapse
Affiliation(s)
- Sathishkumar Moorthy
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea
| | - Young Hoon Joo
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea.
| |
Collapse
|
4
|
Wen J, Chu H, Lai Z, Xu T, Shen L. Enhanced robust spatial feature selection and correlation filter learning for UAV tracking. Neural Netw 2023; 161:39-54. [PMID: 36735999 DOI: 10.1016/j.neunet.2023.01.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 12/26/2022] [Accepted: 01/04/2023] [Indexed: 01/27/2023]
Abstract
Spatial boundary effect can significantly reduce the performance of a learned discriminative correlation filter (DCF) model. A commonly used method to relieve this effect is to extract appearance features from a wider region of a target. However, this way would introduce unexpected features from background pixels and noises, which will lead to a decrease of the filter's discrimination power. To address this shortcoming, this paper proposes an innovative method called enhanced robust spatial feature selection and correlation filter Learning (EFSCF), which performs jointly sparse feature learning to handle boundary effects effectively while suppressing the influence of background pixels and noises. Unlike the ℓ2-norm-based tracking approaches that are prone to non-Gaussian noises, the proposed method imposes the ℓ2,1-norm on the loss term to enhance the robustness against the training outliers. To enhance the discrimination further, a jointly sparse feature selection scheme based on the ℓ2,1 -norm is designed to regularize the filter in rows and columns simultaneously. To the best of the authors' knowledge, this has been the first work exploring the structural sparsity in rows and columns of a learned filter simultaneously. The proposed model can be efficiently solved by an alternating direction multiplier method. The proposed EFSCF is verified by experiments on four challenging unmanned aerial vehicle datasets under severe noise and appearance changes, and the results show that the proposed method can achieve better tracking performance than the state-of-the-art trackers.
Collapse
Affiliation(s)
- Jiajun Wen
- College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China; Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, China; Guangdong Laboratory of Artificial-Intelligence and Cyber-Economics (SZ), Shenzhen University 518060, China
| | - Honglin Chu
- College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Zhihui Lai
- College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China; Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, China; Guangdong Laboratory of Artificial-Intelligence and Cyber-Economics (SZ), Shenzhen University 518060, China; Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen 518129, China.
| | - Tianyang Xu
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, 214122, China
| | - Linlin Shen
- College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China; Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, China; Guangdong Laboratory of Artificial-Intelligence and Cyber-Economics (SZ), Shenzhen University 518060, China
| |
Collapse
|
5
|
Elayaperumal D, Joo YH. Learning Spatial Variance-Key Surrounding-Aware Tracking via Multi-Expert Deep Feature Fusion. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.02.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
6
|
WATB: Wild Animal Tracking Benchmark. Int J Comput Vis 2022. [DOI: 10.1007/s11263-022-01732-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
7
|
Han R, Feng W, Zhang Y, Zhao J, Wang S. Multiple Human Association and Tracking From Egocentric and Complementary Top Views. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:5225-5242. [PMID: 33798068 DOI: 10.1109/tpami.2021.3070562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Crowded scene surveillance can significantly benefit from combining egocentric-view and its complementary top-view cameras. A typical setting is an egocentric-view camera, e.g., a wearable camera on the ground capturing rich local details, and a top-view camera, e.g., a drone-mounted one from high altitude providing a global picture of the scene. To collaboratively analyze such complementary-view videos, an important task is to associate and track multiple people across views and over time, which is challenging and differs from classical human tracking, since we need to not only track multiple subjects in each video, but also identify the same subjects across the two complementary views. This paper formulates it as a constrained mixed integer programming problem, wherein a major challenge is how to effectively measure subjects similarity over time in each video and across two views. Although appearance and motion consistencies well apply to over-time association, they are not good at connecting two highly different complementary views. To this end, we present a spatial distribution based approach to reliable cross-view subject association. We also build a dataset to benchmark this new challenging task. Extensive experiments verify the effectiveness of our method.
Collapse
|
8
|
Behaviour Detection and Recognition of College Basketball Players Based on Multimodal Sequence Matching and Deep Neural Networks. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:7599685. [PMID: 35655509 PMCID: PMC9155933 DOI: 10.1155/2022/7599685] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Revised: 04/27/2022] [Accepted: 05/04/2022] [Indexed: 11/17/2022]
Abstract
This study fuses multimodal sequence matching with a deep neural network algorithm for college basketball player behavior detection and recognition to conduct in-depth research and analysis, analyzing the basic components of basketball technical action videos by studying the practical application of technical actions in professional games and teaching videos from self-published authors of short video platforms. The characteristics of the dataset are also analyzed through literature research related to the basketball action dataset. On the established basketball technical action dataset, combined with the SSD target detection algorithm, the video images with cropping of human motion regions to reduce the size of image frames, generating a basketball technical action dataset based on cropped frames, reduces the amount of network training and improves the efficiency of subsequent action recognition training. In this study, by analyzing the characteristics of basic camera motion, a univariate global motion model is proposed to introduce a quadratic term to accurately express the shaking transformation, while the horizontal and vertical motion are independently represented to reduce the model complexity. Comparative experimental results show that the proposed model achieves a good balance between complexity and accuracy of global motion representation. It is suitable for global motion modeling in behavior recognition applications, laying the foundation for global and local motion estimation. On this basis, the visual feature change pattern of the key area of the scene (basketball area) is combined with the group behavior recognition based on motion patterns and the success-failure classification based on key visual information to achieve basketball semantic event recognition. The experimental results at NCAA show that the fusion of global and local motion patterns can effectively improve group behavior recognition performance. The semantic event recognition algorithm combining motion patterns and video key visual information achieves the best performance.
Collapse
|
9
|
Bi H, Zhu H, Yang L, Wu R. Multi-Scale Attention and Encoder-Decoder Network for Video Saliency Object Detection. PATTERN RECOGNITION AND IMAGE ANALYSIS 2022. [DOI: 10.1134/s1054661822020031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
10
|
Yang C, Zhang X, Song Z. CTT: CNN Meets Transformer for Tracking. SENSORS (BASEL, SWITZERLAND) 2022; 22:3210. [PMID: 35590900 PMCID: PMC9105974 DOI: 10.3390/s22093210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 04/14/2022] [Accepted: 04/18/2022] [Indexed: 01/27/2023]
Abstract
Siamese networks are one of the most popular directions in the visual object tracking based on deep learning. In Siamese networks, the feature pyramid network (FPN) and the cross-correlation complete feature fusion and the matching of features extracted from the template and search branch, respectively. However, object tracking should focus on the global and contextual dependencies. Hence, we introduce a delicate residual transformer structure which contains a self-attention mechanism called encoder-decoder into our tracker as the part of neck. Under the encoder-decoder structure, the encoder promotes the interaction between the low-level features extracted from the target and search branch by the CNN to obtain global attention information, while the decoder replaces cross-correlation to send global attention information into the head module. We add a spatial and channel attention component in the target branch, which can further improve the accuracy and robustness of our proposed model for a low price. Finally, we detailly evaluate our tracker CTT on GOT-10k, VOT2019, OTB-100, LaSOT, NfS, UAV123 and TrackingNet benchmarks, and our proposed method obtains competitive results with the state-of-the-art algorithms.
Collapse
Affiliation(s)
- Chen Yang
- Xi’an Institute of Optics and Precision Mechanics of CAS, Xi’an 710000, China; (C.Y.); (X.Z.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Ximing Zhang
- Xi’an Institute of Optics and Precision Mechanics of CAS, Xi’an 710000, China; (C.Y.); (X.Z.)
| | - Zongxi Song
- Xi’an Institute of Optics and Precision Mechanics of CAS, Xi’an 710000, China; (C.Y.); (X.Z.)
| |
Collapse
|
11
|
Wang Z, Zhou Z, Lu H, Hu Q, Jiang J. Video Saliency Prediction via Joint Discrimination and Local Consistency. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1490-1501. [PMID: 32452797 DOI: 10.1109/tcyb.2020.2989158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
While saliency detection on static images has been widely studied, the research on video saliency detection is still in an early stage and requires more efforts due to the challenge to bring both local and global consistency of salient objects into full consideration. In this article, we propose a novel dynamic saliency network based on both local consistency and global discriminations, via which semantic features across video frames are simultaneously extracted and a recurrent feature optimization structure is designed to further enhance its performances. To ensure that the generated dynamic salient map is more concentrated, we design a lightweight discriminator with a local consistency loss LC to identify subtle differences between predicted maps and ground truths. As a result, the proposed network can be further stimulated to produce more realistic saliency maps with smoother boundaries and simpler layer transitions. The added LC loss forces the network to pay more attention to the local consistency between continuous saliency maps. Both qualitative and quantitative experiments are carried out on three large datasets, and the results demonstrate that our proposed network not only achieves improved performances but also shows good robustness.
Collapse
|
12
|
|
13
|
Zhang Y, Liu G, Huang H, Xiong R, Zhang H. Dual-stream collaborative tracking algorithm combined with reliable memory based update. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.01.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
14
|
|
15
|
Towards accurate estimation for visual object tracking with multi-hierarchy feature aggregation. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.04.075] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
16
|
|
17
|
Wu S, Zhang K, Li S, Yan J. Joint feature embedding learning and correlation filters for aircraft tracking with infrared imagery. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.04.018] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
18
|
|
19
|
Vijayan M, Raguraman P, Mohan R. A Fully Residual Convolutional Neural Network for Background Subtraction. Pattern Recognit Lett 2021. [DOI: 10.1016/j.patrec.2021.02.017] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
20
|
Fan N, Li X, Zhou Z, Liu Q, He Z. Learning dual-margin model for visual tracking. Neural Netw 2021; 140:344-354. [PMID: 33930720 DOI: 10.1016/j.neunet.2021.04.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2020] [Revised: 03/21/2021] [Accepted: 04/01/2021] [Indexed: 11/29/2022]
Abstract
Existing trackers usually exploit robust features or online updating mechanisms to deal with target variations which is a key challenge in visual tracking. However, the features being robust to variations remain little spatial information, and existing online updating methods are prone to overfitting. In this paper, we propose a dual-margin model for robust and accurate visual tracking. The dual-margin model comprises an intra-object margin between different target appearances and an inter-object margin between the target and the background. The proposed method is able to not only distinguish the target from the background but also perceive the target changes, which tracks target appearance changing and facilitates accurate target state estimation. In addition, to exploit rich off-line video data and learn general rules of target appearance variations, we train the dual-margin model on a large off-line video dataset. We perform tracking under a Siamese framework using the constructed appearance set as templates. The proposed method achieves accurate and robust tracking performance on five public datasets while running in real-time. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of the proposed algorithm.
Collapse
Affiliation(s)
- Nana Fan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China.
| | - Xin Li
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China.
| | - Zikun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China.
| | - Qiao Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China.
| | - Zhenyu He
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China.
| |
Collapse
|
21
|
MS-Faster R-CNN: Multi-Stream Backbone for Improved Faster R-CNN Object Detection and Aerial Tracking from UAV Images. REMOTE SENSING 2021. [DOI: 10.3390/rs13091670] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Tracking objects across multiple video frames is a challenging task due to several difficult issues such as occlusions, background clutter, lighting as well as object and camera view-point variations, which directly affect the object detection. These aspects are even more emphasized when analyzing unmanned aerial vehicles (UAV) based images, where the vehicle movement can also impact the image quality. A common strategy employed to address these issues is to analyze the input images at different scales to obtain as much information as possible to correctly detect and track the objects across video sequences. Following this rationale, in this paper, we introduce a simple yet effective novel multi-stream (MS) architecture, where different kernel sizes are applied to each stream to simulate a multi-scale image analysis. The proposed architecture is then used as backbone for the well-known Faster-R-CNN pipeline, defining a MS-Faster R-CNN object detector that consistently detects objects in video sequences. Subsequently, this detector is jointly used with the Simple Online and Real-time Tracking with a Deep Association Metric (Deep SORT) algorithm to achieve real-time tracking capabilities on UAV images. To assess the presented architecture, extensive experiments were performed on the UMCD, UAVDT, UAV20L, and UAV123 datasets. The presented pipeline achieved state-of-the-art performance, confirming that the proposed multi-stream method can correctly emulate the robust multi-scale image analysis paradigm.
Collapse
|
22
|
Xu M, Fu P, Liu B, Li J. Multi-Stream Attention-Aware Graph Convolution Network for Video Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4183-4197. [PMID: 33822725 DOI: 10.1109/tip.2021.3070200] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recent advances in deep convolution neural networks (CNNs) boost the development of video salient object detection (SOD), and many remarkable deep-CNNs video SOD models have been proposed. However, many existing deep-CNNs video SOD models still suffer from coarse boundaries of the salient object, which may be attributed to the loss of high-frequency information. The traditional graph-based video SOD models can preserve object boundaries well by conducting superpixels/supervoxels segmentation in advance, but they perform weaker in highlighting the whole object than the latest deep-CNNs models, limited by heuristic graph clustering algorithms. To tackle this problem, we find a new way to address this issue under the framework of graph convolution networks (GCNs), taking advantage of graph model and deep neural network. Specifically, a superpixel-level spatiotemporal graph is first constructed among multiple frame-pairs by exploiting the motion cues implied in the frame-pairs. Then the graph data is imported into the devised multi-stream attention-aware GCN, where a novel Edge-Gated graph convolution (GC) operation is proposed to boost the saliency information aggregation on the graph data. A novel attention module is designed to encode the spatiotemporal sematic information via adaptive selection of graph nodes and fusion of the static-specific and the motion-specific graph embedding. Finally, a smoothness-aware regularization term is proposed to enhance the uniformity of salient object. Graph nodes (superpixels) inherently belonging to the same class will be ideally clustered together in the learned embedding space. Extensive experiments have been conducted on three widely used datasets. Compared with fourteen state-of-the-art video SOD models, our proposed method can well retain the salient object boundaries and possess a strong learning ability, which shows that this work is a good practice for designing GCNs for video SOD.
Collapse
|
23
|
Guo Q, Feng W, Gao R, Liu Y, Wang S. Exploring the Effects of Blur and Deblurring to Visual Object Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:1812-1824. [PMID: 33417542 DOI: 10.1109/tip.2020.3045630] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The existence of motion blur can inevitably influence the performance of visual object tracking. However, in contrast to the rapid development of visual trackers, the quantitative effects of increasing levels of motion blur on the performance of visual trackers still remain unstudied. Meanwhile, although image-deblurring can produce visually sharp videos for pleasant visual perception, it is also unknown whether visual object tracking can benefit from image deblurring or not. In this paper, we present a Blurred Video Tracking (BVT) benchmark to address these two problems, which contains a large variety of videos with different levels of motion blurs, as well as ground-truth tracking results. To explore the effects of blur and deblurring to visual object tracking, we extensively evaluate 25 trackers on the proposed BVT benchmark and obtain several new interesting findings. Specifically, we find that light motion blur may improve the accuracy of many trackers, but heavy blur usually hurts the tracking performance. We also observe that image deblurring is helpful to improve tracking accuracy on heavily-blurred videos but hurts the performance of lightly-blurred videos. According to these observations, we then propose a new general GAN-based scheme to improve a tracker's robustness to motion blur. In this scheme, a fine-tuned discriminator can effectively serve as an adaptive blur assessor to enable selective frames deblurring during the tracking process. We use this scheme to successfully improve the accuracy of 6 state-of-the-art trackers on motion-blurred videos.
Collapse
|
24
|
Li Z, Lang C, Liew JH, Li Y, Hou Q, Feng J. Cross-Layer Feature Pyramid Network for Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4587-4598. [PMID: 33872147 DOI: 10.1109/tip.2021.3072811] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Feature pyramid network (FPN) based models, which fuse the semantics and salient details in a progressive manner, have been proven highly effective in salient object detection. However, it is observed that these models often generate saliency maps with incomplete object structures or unclear object boundaries, due to the indirect information propagation among distant layers that makes such fusion structure less effective. In this work, we propose a novel Cross-layer Feature Pyramid Network (CFPN), in which direct cross-layer communication is enabled to improve the progressive fusion in salient object detection. Specifically, the proposed network first aggregates multi-scale features from different layers into feature maps that have access to both the high- and low- level information. Then, it distributes the aggregated features to all the involved layers to gain access to richer context. In this way, the distributed features per layer own both semantics and salient details from all other layers simultaneously, and suffer reduced loss of important information during the progressive feature fusion. At last, CFPN fuses the distributed features of each layer stage-by-stage. This way, the high-level features that contain context useful for locating complete objects are preserved until the final output layer, and the low-level features that contain spatial structure details are embedded into each layer to preserve spatial structural details. Extensive experimental results over six widely used salient object detection benchmarks and with three popular backbones clearly demonstrate that CFPN can accurately locate fairly complete salient regions and effectively segment the object boundaries.
Collapse
|
25
|
Yuan D, Chang X, Huang PY, Liu Q, He Z. Self-Supervised Deep Correlation Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 30:976-985. [PMID: 33259298 DOI: 10.1109/tip.2020.3037518] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
Collapse
|
26
|
Abstract
Siamese network-based trackers are broadly applied to solve visual tracking problems due to its balanced performance in terms of speed and accuracy. Tracking desired objects in challenging scenarios is still one of the fundamental concerns during visual tracking. This research paper proposes a feature refined end-to-end tracking framework with real-time tracking speed and considerable performance. The feature refine network has been incorporated to enhance the target feature representation power, utilizing high-level semantic information. Besides, it allows the network to capture the salient information to locate the target and learns to represent the target feature in a more generalized way advancing the overall tracking performance, particularly in the challenging sequences. But, only the feature refine module is unable to handle such challenges because of its less discriminative ability. To overcome this difficulty, we employ an attention module inside the feature refine network that strengths the tracker discrimination ability between the target and background. Furthermore, we conduct extensive experiments to ensure the proposed tracker’s effectiveness using several popular tracking benchmarks, demonstrating that our proposed model achieves state-of-the-art performance over other trackers.
Collapse
|
27
|
|
28
|
Guo Q, Han R, Feng W, Chen Z, Wan L. Selective Spatial Regularization by Reinforcement Learned Decision Making for Object Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:2999-3013. [PMID: 31796406 DOI: 10.1109/tip.2019.2955292] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Spatial regularization (SR) is known as an effective tool to alleviate the boundary effect of correlation filter (CF), a successful visual object tracking scheme, from which a number of state-of-the-art visual object trackers can be stemmed. Nevertheless, SR highly increases the optimization complexity of CF and its target-driven nature makes spatially-regularized CF trackers may easily lose the occluded targets or the targets surrounded by other similar objects. In this paper, we propose selective spatial regularization (SSR) for CF-tracking scheme. It can achieve not only higher accuracy and robustness, but also higher speed compared with spatially-regularized CF trackers. Specifically, rather than simply relying on foreground information, we extend the objective function of CF tracking scheme to learn the target-context-regularized filters using target-context-driven weight maps. We then formulate the online selection of these weight maps as a decision making problem by a Markov Decision Process (MDP), where the learning of weight map selection is equivalent to policy learning of the MDP that is solved by a reinforcement learning strategy. Moreover, by adding a special state, representing not-updating filters, in the MDP, we can learn when to skip unnecessary or erroneous filter updating, thus accelerating the online tracking. Finally, the proposed SSR is used to equip three popular spatially-regularized CF trackers to significantly boost their tracking accuracy, while achieving much faster online tracking speed. Besides, extensive experiments on five benchmarks validate the effectiveness of SSR.
Collapse
|
29
|
Abstract
Object tracking has always been an interesting and essential research topic in the domain of computer vision, of which the model update mechanism is an essential work, therefore the robustness of it has become a crucial factor influencing the quality of tracking of a sequence. This review analyses on recent tracking model update strategies, where target model update occasion is first discussed, then we give a detailed discussion on update strategies of the target model based on the mainstream tracking frameworks, and the background update frameworks are discussed afterwards. The experimental performances of the trackers in recent researches acting on specific sequences are listed in this review, where the superiority and some failure cases on each of them are discussed, and conclusions based on those performances are then drawn. It is a crucial point that design of a proper background model as well as its update strategy ought to be put into consideration. A cascade update of the template corresponding to each deep network layer based on the contributions of them to the target recognition can also help with more accurate target location, where target saliency information can be utilized as a tool for state estimation.
Collapse
|
30
|
Xiao Y, Li J, Du B, Wu J, Li X, Chang J, Zhou Y. Robust correlation filter tracking with multi-scale spatial view. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.05.017] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
31
|
|