1
|
Arthanari S, Elayaperumal D, Joo YH. Learning temporal regularized spatial-aware deep correlation filter tracking via adaptive channel selection. Neural Netw 2025; 186:107210. [PMID: 39987711 DOI: 10.1016/j.neunet.2025.107210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Revised: 06/11/2024] [Accepted: 01/22/2025] [Indexed: 02/25/2025]
Abstract
In recent years, deep correlation filters have demonstrated outstanding performance in robust object tracking. Nevertheless, the correlation filters encounter challenges in managing huge occlusion, target deviation, and background clutter due to the lack of effective utilization of previous target information. To overcome these issues, we propose a novel temporal regularized spatial-aware deep correlation filter tracking via adaptive channel selection. To do this, we first presented the adaptive channel selection approach, which efficiently handles target deviation by adaptively selecting suitable channels during the learning stage. In addition, the adaptive channel selection method allows for dynamic adjustments to the filter based on the unique characteristics of the target object. This adaptability enhances the tracker's flexibility, making it well-suited for diverse tracking scenarios. Second, we propose the spatial-aware correlation filter with dynamic spatial constraints, which effectively reduces the filter response in the complex background region by distinguishing between the foreground and background regions in the response map. Hence, the target can be easily identified within the foreground region. Third, we designed a temporal regularization approach that improves the target accuracy when the case of large appearance variations. Additionally, this temporal regularization method considers the present and previous frames of the target region, which significantly enhances the tracking ability by utilizing historical information. Finally, we present a comprehensive experiments analysis of the OTB-2013, OTB-2015, TempleColor-128, UAV-123, UAVDT, and DTB-70 benchmark datasets to demonstrate the effectiveness of the proposed approach against the state-of-the-trackers.
Collapse
Affiliation(s)
- Sathiyamoorthi Arthanari
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea
| | - Dinesh Elayaperumal
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea
| | - Young Hoon Joo
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea.
| |
Collapse
|
2
|
Ma S, Wan Z, Zhang L, Hu B, Zhang J, Zhao X. HFFTrack: Transformer tracking via hybrid frequency features. Neural Netw 2025; 186:107269. [PMID: 39999533 DOI: 10.1016/j.neunet.2025.107269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Revised: 12/25/2024] [Accepted: 02/10/2025] [Indexed: 02/27/2025]
Abstract
Numerous Transformer-based trackers have emerged due to the powerful global modeling capabilities of the Transformer. Nevertheless, the Transformer is a low-pass filter with insufficient capacity to extract high-frequency features of the target and these features are essential for target location in tracking tasks. To address this issue, this paper proposes a tracking algorithm that utilizes hybrid frequency features, which explores how to improve the performance of the tracker by fusing target multi-frequency features. Specifically, a novel feature extraction network is designed that uses CNN and Transformer to learn the multi-frequency features of the target in stages, taking advantage of both structures and balancing high- and low-frequency information. Secondly, a dual-branch encoder is designed to allow the tracker to capture global information while learning the local features of the target through another branch. Finally, a multi-frequency features fusion network is designed that uses wavelet transform and convolution to fuse high-frequency and low-frequency features. Extensive experimental results demonstrate that our tracker achieves superior tracking performance on six challenging benchmark datasets (i.e., LaSOT, TrackingNet, GOT-10k, TNL2K, UAV123, and OTB100).
Collapse
Affiliation(s)
- Sugang Ma
- School of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an 710121, China; School of Information Engineering, Chang'an University, Xi'an 710064, China.
| | - Zhen Wan
- School of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an 710121, China.
| | - Licheng Zhang
- School of Information Engineering, Chang'an University, Xi'an 710064, China; Shaanxi Engineering Research Center of Internet of Vehicles and Intelligent Vehicle Testing Technique, Xi'an 710064, China.
| | - Bin Hu
- Department of Computer Science and Technology, Kean University, Union, NJ 07083, United States of America.
| | - Jinyu Zhang
- School of Electronic Engineering, Xi'an University of Posts and Telecommunications, Xi'an 710121, China.
| | - Xiangmo Zhao
- School of Information Engineering, Chang'an University, Xi'an 710064, China.
| |
Collapse
|
3
|
Shao P, Xu T, Tang Z, Li L, Wu XJ, Kittler J. TENet: Targetness entanglement incorporating with multi-scale pooling and mutually-guided fusion for RGB-E object tracking. Neural Netw 2025; 183:106948. [PMID: 39657526 DOI: 10.1016/j.neunet.2024.106948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 09/07/2024] [Accepted: 11/20/2024] [Indexed: 12/12/2024]
Abstract
There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet.
Collapse
Affiliation(s)
- Pengcheng Shao
- Josef Kittler Research Institute on Artificial Intelligence, China; Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China; International Joint Laboratory on Artificial Intelligence, Ministry of Education, China; International Joint Laboratory on Artificial Intelligence, Jiangsu Province, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
| | - Tianyang Xu
- Josef Kittler Research Institute on Artificial Intelligence, China; Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China; International Joint Laboratory on Artificial Intelligence, Ministry of Education, China; International Joint Laboratory on Artificial Intelligence, Jiangsu Province, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
| | - Zhangyong Tang
- Josef Kittler Research Institute on Artificial Intelligence, China; Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China; International Joint Laboratory on Artificial Intelligence, Ministry of Education, China; International Joint Laboratory on Artificial Intelligence, Jiangsu Province, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
| | - Linze Li
- Josef Kittler Research Institute on Artificial Intelligence, China; Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China; International Joint Laboratory on Artificial Intelligence, Ministry of Education, China; International Joint Laboratory on Artificial Intelligence, Jiangsu Province, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
| | - Xiao-Jun Wu
- Josef Kittler Research Institute on Artificial Intelligence, China; Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China; International Joint Laboratory on Artificial Intelligence, Ministry of Education, China; International Joint Laboratory on Artificial Intelligence, Jiangsu Province, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China.
| | - Josef Kittler
- Josef Kittler Research Institute on Artificial Intelligence, China; Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford GU2 7XH, UK; Sino-UK Joint Laboratory on Artificial Intelligence, Ministry of Science and Technology, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
| |
Collapse
|
4
|
Zhang Y, Pan H, Wang J. Enabling deformation slack in tracking with temporally even correlation filters. Neural Netw 2025; 181:106839. [PMID: 39509809 DOI: 10.1016/j.neunet.2024.106839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 09/09/2024] [Accepted: 10/21/2024] [Indexed: 11/15/2024]
Abstract
Discriminative correlation filters with temporal regularization have recently attracted much attention in mobile video tracking, due to the challenges of target occlusion and background interference. However, rigidly penalizing the variability of templates between adjacent frames makes trackers lazy for target evolution, leading to inaccurate responses or even tracking failure when deformation occurs. In this paper, we address the problem of instant template learning when the target undergoes drastic variations in appearance and aspect ratio. We first propose a temporally even model featuring deformation slack, which theoretically supports the ability of the template to respond quickly to variations while suppressing disturbances. Then, an optimal derivation of our model is formulated, and the closed form solutions are deduced to facilitate the algorithm implementation. Further, we introduce a cyclic shift methodology for mirror factors to achieve scale estimation of varying aspect ratios, thereby dramatically improving the cross-area accuracy. Comprehensive comparisons on seven datasets demonstrate our excellent performance: DroneTB-70, VisDrone-SOT2019, VOT-2019, LaSOT, TC-128, UAV-20L, and UAVDT. Our approach runs at 16.9 frames per second on a low-cost Central Processing Unit, which makes it suitable for tracking on drones. The code and raw results will be made publicly available at: https://github.com/visualperceptlab/TEDS.
Collapse
Affiliation(s)
- Yuanming Zhang
- Research Institute of Intelligent Control and Systems, Harbin Institute of Technology, Harbin, 150001, China
| | - Huihui Pan
- Research Institute of Intelligent Control and Systems, Harbin Institute of Technology, Harbin, 150001, China.
| | - Jue Wang
- Ningbo Institute of Intelligent Equipment Technology Company Ltd., Ningbo, 315200, China; Department of Automation, University of Science and Technology of China, Hefei, 230027, China
| |
Collapse
|
5
|
Yang X, Che H, Leung MF, Wen S. Self-paced regularized adaptive multi-view unsupervised feature selection. Neural Netw 2024; 175:106295. [PMID: 38614023 DOI: 10.1016/j.neunet.2024.106295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 03/14/2024] [Accepted: 04/05/2024] [Indexed: 04/15/2024]
Abstract
Multi-view unsupervised feature selection (MUFS) is an efficient approach for dimensional reduction of heterogeneous data. However, existing MUFS approaches mostly assign the samples the same weight, thus the diversity of samples is not utilized efficiently. Additionally, due to the presence of various regularizations, the resulting MUFS problems are often non-convex, making it difficult to find the optimal solutions. To address this issue, a novel MUFS method named Self-paced Regularized Adaptive Multi-view Unsupervised Feature Selection (SPAMUFS) is proposed. Specifically, the proposed approach firstly trains the MUFS model with simple samples, and gradually learns complex samples by using self-paced regularizer. l2,p-norm (0
Collapse
Affiliation(s)
- Xuanhao Yang
- College of Electronic and Information Engineering, Southwest University, Chongqing, 400715, China.
| | - Hangjun Che
- College of Electronic and Information Engineering, Southwest University, Chongqing, 400715, China; Chongqing Key Laboratory of Nonlinear Circuits and Intelligent Information Processing, Chongqing, 400715, China.
| | - Man-Fai Leung
- School of Computing and Information Science, Faculty of Science and Engineering, Anglia Ruskin University, Cambridge, UK.
| | - Shiping Wen
- Faculty of Engineering and Information Technology, Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney, NSW 2007, Australia.
| |
Collapse
|
6
|
Wang J, Lai C, Wang Y, Zhang W. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention. Neural Netw 2024; 172:106110. [PMID: 38237443 DOI: 10.1016/j.neunet.2024.106110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 11/14/2023] [Accepted: 01/05/2024] [Indexed: 02/28/2024]
Abstract
The tracking methods based on Transformer have shown great potential in visual tracking and achieved significant tracking performance. The traditional transformer based feature fusion network divides a whole feature map into multiple image patches as its inputs, and then directly processes them in parallel, which will occupy a lot of computing resources and affect the computing efficiency of multi-head attention. In this paper, we design a novel feature fusion network with optimized multi-head attention in encoder and decoder architecture based on Transformer. The designed feature fusion network preprocess the input features and change the calculations of multi-head attention by using both the efficient multi-head self-attention module and efficient multi-head spatial reduction attention module. The two modules can reduce the influence of irrelevant background information, enhance the representation ability of template features and search region features, and greatly reduce the computational complexity. We propose a novel Transformer tracking method (named EMAT) based on the designed feature fusion network. The proposed EMAT is evaluated on seven challenging tracking benchmarks to demonstrate its superiority, including LaSOT, GOT-10k, TrackingNet, UAV123, VOT2018, NfS and VOT-RGBT2019. The proposed tracker achieves well tracking performance, and obtains precision score of 89.0% on UAV123, AUC score of 64.6% on LaSOT, EAO score of 34.8% on VOT-RGBT2019, which outperforms most advanced trackers. EMAT runs at a real-time speed of about 35 FPS during tracking.
Collapse
Affiliation(s)
- Jun Wang
- School of Information Engineering, Nanchang Institute of Technology, Nanchang, 330029, China.
| | - Changwang Lai
- School of Information Engineering, Nanchang Institute of Technology, Nanchang, 330029, China
| | - Yuanyun Wang
- School of Information Engineering, Nanchang Institute of Technology, Nanchang, 330029, China.
| | - Wenshuang Zhang
- School of Information Engineering, Nanchang Institute of Technology, Nanchang, 330029, China
| |
Collapse
|
7
|
Hu X, Zhao J, Hui Y, Li S, You S. SiamHSFT: A Siamese Network-Based Tracker with Hierarchical Sparse Fusion and Transformer for UAV Tracking. SENSORS (BASEL, SWITZERLAND) 2023; 23:8666. [PMID: 37960366 PMCID: PMC10648809 DOI: 10.3390/s23218666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 10/18/2023] [Accepted: 10/18/2023] [Indexed: 11/15/2023]
Abstract
Due to high maneuverability as well as hardware limitations of Unmanned Aerial Vehicle (UAV) platforms, tracking targets in UAV views often encounter challenges such as low resolution, fast motion, and background interference, which make it difficult to strike a compatibility between performance and efficiency. Based on the Siamese network framework, this paper proposes a novel UAV tracking algorithm, SiamHSFT, aiming to achieve a balance between tracking robustness and real-time computation. Firstly, by combining CBAM attention and downward information interaction in the feature enhancement module, the provided method merges high-level and low-level feature maps to prevent the loss of information when dealing with small targets. Secondly, it focuses on both long and short spatial intervals within the affinity in the interlaced sparse attention module, thereby enhancing the utilization of global context and prioritizing crucial information in feature extraction. Lastly, the Transformer's encoder is optimized with a modulation enhancement layer, which integrates triplet attention to enhance inter-layer dependencies and improve target discrimination. Experimental results demonstrate SiamHSFT's excellent performance across diverse datasets, including UAV123, UAV20L, UAV123@10fps, and DTB70. Notably, it performs better in fast motion and dynamic blurring scenarios. Meanwhile, it maintains an average tracking speed of 126.7 fps across all datasets, meeting real-time tracking requirements.
Collapse
Affiliation(s)
- Xiuhua Hu
- School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China
- State and Provincial Joint Engineering Laboratory of Advanced Network, Monitoring and Control, Xi’an 710021, China
| | - Jing Zhao
- School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China
- State and Provincial Joint Engineering Laboratory of Advanced Network, Monitoring and Control, Xi’an 710021, China
| | - Yan Hui
- School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China
- State and Provincial Joint Engineering Laboratory of Advanced Network, Monitoring and Control, Xi’an 710021, China
| | - Shuang Li
- School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China
- State and Provincial Joint Engineering Laboratory of Advanced Network, Monitoring and Control, Xi’an 710021, China
| | - Shijie You
- School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China
- State and Provincial Joint Engineering Laboratory of Advanced Network, Monitoring and Control, Xi’an 710021, China
| |
Collapse
|
8
|
Moorthy S, Joo YH. Learning dynamic spatial-temporal regularized correlation filter tracking with response deviation suppression via multi-feature fusion. Neural Netw 2023; 167:360-379. [PMID: 37673025 DOI: 10.1016/j.neunet.2023.08.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 08/10/2023] [Accepted: 08/12/2023] [Indexed: 09/08/2023]
Abstract
Visual object tracking (VOT) for intelligent video surveillance has attracted great attention in the current research community, thanks to advances in computer vision and camera technology. Meanwhile, discriminative correlation filter (DCF) trackers garnered significant interest owing to their high accuracy and low computing cost. Many researchers have introduced spatial and temporal regularization into the DCF framework to achieve a more robust appearance model and further improve tracking performance. However, these algorithms typically set fixed spatial and temporal regularization parameters, which limit flexibility and adaptability under cluttered and challenging scenarios. To overcome these problems, in this work, we propose a new dynamic spatial-temporal regularization for the DCF tracking model that emphasizes the filter to concentrate on more reliable regions during the training stage. Furthermore, we present a response deviation-suppressed regularization term for responses to encourage temporal consistency and avoid model degradation by suppressing relative response changes between two consecutive frames. Moreover, we introduce a multi-memory tracking framework to exploit various features and each memory contributes to tracking the target across all frames. Significant experiments on the OTB-2013, OTB-2015, TC-128, UAV-123, UAVDT, and DTB-70 datasets have revealed that the performance thereof outperformed many state-of-the-art trackers based on DCF and deep-based frameworks in terms of tracking accuracy and tracking success rate.
Collapse
Affiliation(s)
- Sathishkumar Moorthy
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea
| | - Young Hoon Joo
- School of IT Information and Control Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si, Jeonbuk 54150, Republic of Korea.
| |
Collapse
|