1
|
Li B, Peng F, Hui T, Wei X, Wei X, Zhang L, Shi H, Liu S. RGB-T Tracking With Template-Bridged Search Interaction and Target-Preserved Template Updating. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:634-649. [PMID: 39374289 DOI: 10.1109/tpami.2024.3475472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]
Abstract
The goal of RGB-Thermal (RGB-T) tracking is to utilize the synergistic and complementary strengths of RGB and TIR modalities to enhance tracking in diverse situations, with cross-modal interaction being a crucial element. Earlier methods often simply combine the features of the RGB and TIR search frames, leading to a coarse interaction that also introduced unnecessary background noise. Many other approaches sample candidate boxes from search frames and apply different fusion techniques to individual pairs of RGB and TIR boxes, which confines cross-modal interactions to local areas and results in insufficient context modeling. Additionally, mining video temporal contexts is also under-explored in RGB-T tracking. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module that exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. An Illumination Guided Fusion (IGF) module is designed to adaptively fuse RGB and TIR search region tokens with a global illumination factor. Furthermore, in the inference stage, we also propose an efficient Target-Preserved Template Updating (TPTU) strategy, leveraging the temporal context within video sequences to accommodate the target's appearance change. Our proposed modules are integrated into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances.
Collapse
|
2
|
Huang L, Dong B, Lu J, Zhang W. Mild Policy Evaluation for Offline Actor-Critic. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17950-17964. [PMID: 37676802 DOI: 10.1109/tnnls.2023.3309906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
In offline actor-critic (AC) algorithms, the distributional shift between the training data and target policy causes optimistic value estimates for out-of-distribution (OOD) actions. This leads to learned policies skewed toward OOD actions with falsely high values. The existing value-regularized offline AC algorithms address this issue by learning a conservative value function, leading to a performance drop. In this article, we propose a mild policy evaluation (MPE) by constraining the difference between the values of actions supported by the target policy and those of actions contained within the offline dataset. The convergence of the proposed MPE, the gap between the learned value function and the true one, and the suboptimality of the offline AC with MPE are analyzed, respectively. A mild offline AC (MOAC) algorithm is developed by integrating MPE into off-policy AC. Compared with existing offline AC algorithms, the value function gap of MOAC is bounded by the existence of sampling errors. Moreover, in the absence of sampling errors, the true state value function can be obtained. Experimental results on the D4RL benchmark dataset demonstrate the effectiveness of MPE and the performance superiority of MOAC compared to the state-of-the-art offline reinforcement learning (RL) algorithms.
Collapse
|
3
|
Liu H, Ye M, Wang Y, Zhao S, Li P, Shen J. A New Framework of Collaborative Learning for Adaptive Metric Distillation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:8266-8277. [PMID: 37022854 DOI: 10.1109/tnnls.2022.3226569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
This article presents a new adaptive metric distillation approach that can significantly improve the student networks' backbone features, along with better classification results. Previous knowledge distillation (KD) methods usually focus on transferring the knowledge across the classifier logits or feature structure, ignoring the excessive sample relations in the feature space. We demonstrated that such a design greatly limits performance, especially for the retrieval task. The proposed collaborative adaptive metric distillation (CAMD) has three main advantages: 1) the optimization focuses on optimizing the relationship between key pairs by introducing the hard mining strategy into the distillation framework; 2) it provides an adaptive metric distillation that can explicitly optimize the student feature embeddings by applying the relation in the teacher embeddings as supervision; and 3) it employs a collaborative scheme for effective knowledge aggregation. Extensive experiments demonstrated that our approach sets a new state-of-the-art in both the classification and retrieval tasks, outperforming other cutting-edge distillers under various settings.
Collapse
|
4
|
Chen J, Huang G, Yuan X, Zhong G, Zheng Z, Pun CM, Zhu J, Huang Z. Quaternion Cross-Modality Spatial Learning for Multi-Modal Medical Image Segmentation. IEEE J Biomed Health Inform 2024; 28:1412-1423. [PMID: 38145537 DOI: 10.1109/jbhi.2023.3346529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2023]
Abstract
Recently, the Deep Neural Networks (DNNs) have had a large impact on imaging process including medical image segmentation, and the real-valued convolution of DNN has been extensively utilized in multi-modal medical image segmentation to accurately segment lesions via learning data information. However, the weighted summation operation in such convolution limits the ability to maintain spatial dependence that is crucial for identifying different lesion distributions. In this paper, we propose a novel Quaternion Cross-modality Spatial Learning (Q-CSL) which explores the spatial information while considering the linkage between multi-modal images. Specifically, we introduce to quaternion to represent data and coordinates that contain spatial information. Additionally, we propose Quaternion Spatial-association Convolution to learn the spatial information. Subsequently, the proposed De-level Quaternion Cross-modality Fusion (De-QCF) module excavates inner space features and fuses cross-modality spatial dependency. Our experimental results demonstrate that our approach compared to the competitive methods perform well with only 0.01061 M parameters and 9.95G FLOPs.
Collapse
|
5
|
Dai W, Liu R, Wu T, Wang M, Yin J, Liu J. Deeply Supervised Skin Lesions Diagnosis With Stage and Branch Attention. IEEE J Biomed Health Inform 2024; 28:719-729. [PMID: 37624725 DOI: 10.1109/jbhi.2023.3308697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2023]
Abstract
Accurate and unbiased examinations of skin lesions are critical for the early diagnosis and treatment of skin diseases. Visual features of skin lesions vary significantly because the images are collected from patients with different lesion colours and morphologies by using dissimilar imaging equipment. Recent studies have reported that ensembled convolutional neural networks (CNNs) are practical to classify the images for early diagnosis of skin disorders. However, the practical use of these ensembled CNNs is limited as these networks are heavyweight and inadequate for processing contextual information. Although lightweight networks (e.g., MobileNetV3 and EfficientNet) were developed to achieve parameter reduction for implementing deep neural networks on mobile devices, insufficient depth of feature representation restricts the performance. To address the existing limitations, we develop a new lite and effective neural network, namely HierAttn. The HierAttn applies a novel deep supervision strategy to learn the local and global features by using multi-stage and multi-branch attention mechanisms with only one training loss. The efficacy of HierAttn was evaluated by using the dermoscopy images dataset ISIC2019 and smartphone photos dataset PAD-UFES-20 (PAD2020). The experimental results show that HierAttn achieves the best accuracy and area under the curve (AUC) among the state-of-the-art lightweight networks.
Collapse
|
6
|
Pei G, Shen F, Yao Y, Chen T, Hua XS, Shen HT. Hierarchical Graph Pattern Understanding for Zero-Shot Video Object Segmentation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:5909-5920. [PMID: 37883290 DOI: 10.1109/tip.2023.3326395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
The optical flow guidance strategy is ideal for obtaining motion information of objects in the video. It is widely utilized in video segmentation tasks. However, existing optical flow-based methods have a significant dependency on optical flow, which results in poor performance when the optical flow estimation fails for a particular scene. The temporal consistency provided by the optical flow could be effectively supplemented by modeling in a structural form. This paper proposes a new hierarchical graph neural network (GNN) architecture, dubbed hierarchical graph pattern understanding (HGPU), for zero-shot video object segmentation (ZS-VOS). Inspired by the strong ability of GNNs in capturing structural relations, HGPU innovatively leverages motion cues (i.e., optical flow) to enhance the high-order representations from the neighbors of target frames. Specifically, a hierarchical graph pattern encoder with message aggregation is introduced to acquire different levels of motion and appearance features in a sequential manner. Furthermore, a decoder is designed for hierarchically parsing and understanding the transformed multi-modal contexts to achieve more accurate and robust results. HGPU achieves state-of-the-art performance on four publicly available benchmarks (DAVIS-16, YouTube-Objects, Long-Videos and DAVIS-17). Code and pre-trained model can be found at https://github.com/NUST-Machine-Intelligence-Laboratory/HGPU.
Collapse
|
7
|
Zhao J, Dai K, Zhang P, Wang D, Lu H. Robust Online Tracking With Meta-Updater. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:6168-6182. [PMID: 36040937 DOI: 10.1109/tpami.2022.3202785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In a sequence, the appearance of both the target and background often changes dramatically. Offline-trained models may not handle huge appearance variations well, causing tracking failures. Most discriminative trackers address this issue by introducing an online update scheme, making the model dynamically adapt the changes of the target and background. Although the online update scheme plays an important role in improving the tracker's accuracy, it inevitably pollutes the model with noisy observation samples. It is necessary to reduce the risk of the online update scheme for better tracking. In this work, we propose a novel offline-trained Meta-Updater to address an important but unsolved problem: Is the tracker ready for updating in the current frame? The proposed module can effectively integrate geometric, discriminative, and appearance cues in a sequential manner, and then mine the sequential information with a designed cascaded LSTM module. Moreover, we strengthen the effect of appearance information on the module, i.e., the additional local outlier factor is introduced to integrate into a newly designed network. We integrate our meta-updater into eight different types of online update trackers. Extensive experiments on four long-term and two short-term tracking benchmarks demonstrate that our meta-updater is effective and has strong generalization ability.
Collapse
|
8
|
Javed S, Mahmood A, Qaiser T, Werghi N. Knowledge Distillation in Histology Landscape by Multi-Layer Features Supervision. IEEE J Biomed Health Inform 2023; 27:2037-2046. [PMID: 37021915 DOI: 10.1109/jbhi.2023.3237749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Automatic tissue classification is a fundamental task in computational pathology for profiling tumor micro-environments. Deep learning has advanced tissue classification performance at the cost of significant computational power. Shallow networks have also been end-to-end trained using direct supervision however their performance degrades because of the lack of capturing robust tissue heterogeneity. Knowledge distillation has recently been employed to improve the performance of the shallow networks used as student networks by using additional supervision from deep neural networks used as teacher networks. In the current work, we propose a novel knowledge distillation algorithm to improve the performance of shallow networks for tissue phenotyping in histology images. For this purpose, we propose multi-layer feature distillation such that a single layer in the student network gets supervision from multiple teacher layers. In the proposed algorithm, the size of the feature map of two layers is matched by using a learnable multi-layer perceptron. The distance between the feature maps of the two layers is then minimized during the training of the student network. The overall objective function is computed by summation of the loss over multiple layers combination weighted with a learnable attention-based parameter. The proposed algorithm is named as Knowledge Distillation for Tissue Phenotyping (KDTP). Experiments are performed on five different publicly available histology image classification datasets using several teacher-student network combinations within the KDTP algorithm. Our results demonstrate a significant performance increase in the student networks by using the proposed KDTP algorithm compared to direct supervision-based training methods.
Collapse
|
9
|
Yang Y, Gu X. Joint Correlation and Attention Based Feature Fusion Network for Accurate Visual Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1705-1715. [PMID: 37028050 DOI: 10.1109/tip.2023.3251027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Correlation operation and attention mechanism are two popular feature fusion approaches which play an important role in visual object tracking. However, the correlation-based tracking networks are sensitive to location information but loss some context semantics, while the attention-based tracking networks can make full use of rich semantic information but ignore the position distribution of the tracked object. Therefore, in this paper, we propose a novel tracking framework based on joint correlation and attention networks, termed as JCAT, which can effectively combine the advantages of these two complementary feature fusion approaches. Concretely, the proposed JCAT approach adopts parallel correlation and attention branches to generate position and semantic features. Then the fusion features are obtained by directly adding the location feature and semantic feature. Finally, the fused features are fed into the segmentation network to generate the pixel-wise state estimation of the object. Furthermore, we develop a segmentation memory bank and an online sample filtering mechanism for robust segmentation and tracking. The extensive experimental results on eight challenging visual tracking benchmarks show that the proposed JCAT tracker achieves very promising tracking performance and sets a new state-of-the-art on the VOT2018 benchmark.
Collapse
|
10
|
Xu T, Feng Z, Wu XJ, Kittler J. Toward Robust Visual Object Tracking With Independent Target-Agnostic Detection and Effective Siamese Cross-Task Interaction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1541-1554. [PMID: 37027596 DOI: 10.1109/tip.2023.3246800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Advanced Siamese visual object tracking architectures are jointly trained using pair-wise input images to perform target classification and bounding box regression. They have achieved promising results in recent benchmarks and competitions. However, the existing methods suffer from two limitations: First, though the Siamese structure can estimate the target state in an instance frame, provided the target appearance does not deviate too much from the template, the detection of the target in an image cannot be guaranteed in the presence of severe appearance variations. Second, despite the classification and regression tasks sharing the same output from the backbone network, their specific modules and loss functions are invariably designed independently, without promoting any interaction. Yet, in a general tracking task, the centre classification and bounding box regression tasks are collaboratively working to estimate the final target location. To address the above issues, it is essential to perform target-agnostic detection so as to promote cross-task interactions in a Siamese-based tracking framework. In this work, we endow a novel network with a target-agnostic object detection module to complement the direct target inference, and to avoid or minimise the misalignment of the key cues of potential template-instance matches. To unify the multi-task learning formulation, we develop a cross-task interaction module to ensure consistent supervision of the classification and regression branches, improving the synergy of different branches. To eliminate potential inconsistencies that may arise within a multi-task architecture, we assign adaptive labels, rather than fixed hard labels, to supervise the network training more effectively. The experimental results obtained on several benchmarks, i.e., OTB100, UAV123, VOT2018, VOT2019, and LaSOT, demonstrate the effectiveness of the advanced target detection module, as well as the cross-task interaction, exhibiting superior tracking performance as compared with the state-of-the-art tracking methods.
Collapse
|
11
|
Liang T, Li B, Wang M, Tan H, Luo Z. A Closer Look at the Joint Training of Object Detection and Re-Identification in Multi-Object Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 32:267-280. [PMID: 37015359 DOI: 10.1109/tip.2022.3227814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Unifying object detection and re-identification (ReID) into a single network enables faster multi-object tracking (MOT), while this multi-task setting poses challenges for training. In this work, we dissect the joint training of detection and ReID from two dimensions: label assignment and loss function. We find previous works generally overlook them and directly borrow the practices from object detection, inevitably causing inferior performance. Specifically, we identify a qualified label assignment for MOT should: 1) have the assignment cost aware of ReID cost, not just detection cost; 2) provide sufficient positive samples for robust feature learning while avoiding ambiguous positives (i.e., the positives shared by different ground-truth objects). To achieve the above goals, we first propose Identity-aware Label Assignment, which jointly considers the assignment cost of detection and ReID to select positive samples for each instance without ambiguities. Moreover, we advance a novel Discriminative Focal Loss that integrates ReID predictions with Focal Loss to focus the training on the discriminative samples. Finally, we upgrade the strong baseline FairMOT with our techniques and achieve up to 7.0 MOTA / 54.1% IDs improvements on MOT16/17/20 benchmarks under favorable inference speed, which verifies our tailored label assignment and loss function for MOT are superior to those inherited from object detection.
Collapse
|
12
|
Wang X, Chen Z, Jiang B, Tang J, Luo B, Tao D. Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-Based Beam Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6239-6254. [PMID: 36166563 DOI: 10.1109/tip.2022.3208437] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame. However, we found that this may be not an optimal choice, especially when encountering challenging tracking scenarios such as heavy occlusion and fast motion. In particular, if a tracker drifts, errors will be accumulated and would further make response scores estimated by the tracker unreliable in future frames. To address this issue, we propose to maintain multiple tracking trajectories and apply beam search strategy for visual tracking, so that the trajectory with fewer accumulated errors can be identified. Accordingly, this paper introduces a novel multi-agent reinforcement learning based beam search tracking strategy, termed BeamTracking. It is mainly inspired by the image captioning task, which takes an image as input and generates diverse descriptions using beam search algorithm. Accordingly, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. Each maintained trajectory is associated with an agent to perform the decision-making and determine what actions should be taken to update related information. More specifically, using the classification-based tracker as the baseline, we first adopt bi-GRU to encode the target feature, proposal feature, and its response score into a unified state representation. The state feature and greedy search result are then fed into the first agent for independent action selection. Afterwards, the output action and state features are fed into the subsequent agent for diverse results prediction. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
Collapse
|
13
|
Gao Y, Xu H, Zheng Y, Li J, Gao X. An Object Point Set Inductive Tracker for Multi-Object Tracking and Segmentation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6083-6096. [PMID: 36074868 DOI: 10.1109/tip.2022.3203607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Multi-object tracking and segmentation (MOTS) is a derivative task of multi-object tracking (MOT). The new setting encourages the learning of more discriminative high-quality embeddings. In this paper, we focus on the problem of exploring the relationship between the segmenter and the tracker, and propose an efficient Object Point set Inductive Tracker (OPITrack) based on it. First, we discover that after a single attention layer, the high-dimensional, key point embedding will show feature averaging. To alleviate this phenomenon, we propose an embedding generalization training strategy for sparse training and dense testing. This strategy allows the network to increase randomness in training and encourages the tracker to learn more discriminative features. In addition, to learn the desired embedding space, we propose a general Trip-hard sample augmentation loss. The loss uses patches that are not distinguishable by the segmenter to join the feature learning and force the embedding network to learn the difference between false positives and true positives. Our method was validated on two MOTS benchmark datasets and achieved promising results. In addition, our OPITrack can achieve better performance for the raw model while costing less video memory (VRAM) at training time.
Collapse
|
14
|
Wang S, Sheng H, Yang D, Zhang Y, Wu Y, Wang S. Extendable Multiple Nodes Recurrent Tracking Framework With RTU+. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:5257-5271. [PMID: 35881604 DOI: 10.1109/tip.2022.3192706] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Recently, tracking-by-detection has become a popular paradigm in Multiple-object tracking (MOT) for its concise pipeline. Many current works first associate the detections to form track proposals and then score proposalns by manual functions to select the best. However, long-term tracking information is lost in this way due to detection failure or heavy occlusion. In this paper, the Extendable Multiple Nodes Tracking framework (EMNT) is introduced to model the association. Instead of detections, EMNT creates four basic types of nodes including correct, false, dummy and termination to generally model the tracking procedure. Further, we propose a General Recurrent Tracking Unit (RTU++) to score track proposals by capturing long-term information. In addition, we present an efficient generation method of simulated tracking data to overcome the dilemma of limited available data in MOT. The experiments show that our methods achieve state-of-the-art performance on MOT17, MOT20 and HiEve benchmarks. Meanwhile, RTU++ can be flexibly plugged into other trackers such as MHT, and bring significant improvements. The additional experiments on MOTS20 and CTMC-v1 also demonstrate the generalization ability of RTU++ trained by simulated data in various scenarios.
Collapse
|