1
|
Tan B, Xiao Y, Li S, Tong X, Yan T, Cao Z, Tianyi Zhou J. Language-Guided 3-D Action Feature Learning Without Ground-Truth Sample Class Label. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9356-9369. [PMID: 38865228 DOI: 10.1109/tnnls.2024.3409613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2024]
Abstract
This work pays the first research effort to leverage point cloud sequence-based Self-supervised 3-D Action Feature Learning (S3AFL), under text's cross-modality weak supervision. We intend to fill the huge performance gap between point cloud sequence and 3-D skeleton-based manners. The key intuition derives from the observation that skeleton-based manners actually hold the human pose's high-level knowledge that leads to attention on the body's joint-aware local parts. Inspired by this, we propose to introduce the text's weak supervision of high-level semantics into a point cloud sequence-based paradigm. With RGB-point cloud pair sequence acquired via RGB-D camera, text sequence is first generated from RGB component using pretrained image captioning model, as auxiliary weak supervision. Then, S3AFL runs in a cross and intra-modality contrastive learning (CL) way. To resist text's missing and redundant semantics, feature learning is conducted in a multistage way with semantic refinement. Essentially, text is only required for training. To facilitate the feature's representation power on fine-grained actions, a multirank max-pooling (MR-MP) way is also proposed for the point set network to better maintain discriminative clues. Experiments verify that the text's weak supervision can facilitate performance by 10.8%, 10.4%, and 8.0% on NTU RGB+D 60, 120, and N-UCLA at most. The performance gap between point cloud sequence and skeleton-based manners has been remarkably narrowed down. The idea of transferring text's weak supervision to S3AFL can also be applied to a skeleton manner, with strong generality. The source code is available at https://github.com/tangent-T/W3AMT.
Collapse
|
2
|
Wang P, Su F, Zhao Z, Zhao Y, Boulgouris NV. GAReID: Grouped and Attentive High-Order Representation Learning for Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3990-4004. [PMID: 36197859 DOI: 10.1109/tnnls.2022.3209537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
As person parts are frequently misaligned between detected human boxes, an image representation that can handle this part misalignment is required. In this work, we propose an effective grouped attentive re-identification (GAReID) framework to learn part-aligned and background robust representations for person re-identification (ReID). Specifically, the GAReID framework consists of grouped high-order pooling (GHOP) and attentive high-order pooling (AHOP) layers, which generate high-order image and foreground features, respectively. In addition, a novel grouped Kronecker product (GKP) is proposed to use both channel group and shuffle strategies for high-order feature compression, while promoting the representational capabilities of compressed high-order features. We show that our method derives from an interpretable motivation and elegantly reduces part misalignments without using landmark detection or feature partition. This article theoretically and experimentally demonstrates the superiority of the GAReID framework, achieving state-of-the-art performance on various person ReID datasets.
Collapse
|
3
|
Zhu A, Wang Z, Xue J, Wan X, Jin J, Wang T, Snoussi H. Improving Text-Based Person Retrieval by Excavating All-Round Information Beyond Color. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5097-5111. [PMID: 38416620 DOI: 10.1109/tnnls.2024.3368217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Text-based person retrieval is the process of searching a massive visual resource library for images of a particular pedestrian, based on a textual query. Existing approaches often suffer from a problem of color (CLR) over-reliance, which can result in a suboptimal person retrieval performance by distracting the model from other important visual cues such as texture and structure information. To handle this problem, we propose a novel framework to Excavate All-round Information Beyond Color for the task of text-based person retrieval, which is therefore termed EAIBC. The EAIBC architecture includes four branches, namely an RGB branch, a grayscale (GRS) branch, a high-frequency (HFQ) branch, and a CLR branch. Furthermore, we introduce a mutual learning (ML) mechanism to facilitate communication and learning among the branches, enabling them to take full advantage of all-round information in an effective and balanced manner. We evaluate the proposed method on three benchmark datasets, including CUHK-PEDES, ICFG-PEDES, and RSTPReid. The experimental results demonstrate that EAIBC significantly outperforms existing methods and achieves state-of-the-art (SOTA) performance in supervised, weakly supervised, and cross-domain settings.
Collapse
|
4
|
Gao J, Huang Z, Lei Y, Shan H, Wang JZ, Wang FY, Zhang J. Deep Rank-Consistent Pyramid Model for Enhanced Crowd Counting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:299-312. [PMID: 38090870 DOI: 10.1109/tnnls.2023.3336774] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Most conventional crowd counting methods utilize a fully-supervised learning framework to establish a mapping between scene images and crowd density maps. They usually rely on a large quantity of costly and time-intensive pixel-level annotations for training supervision. One way to mitigate the intensive labeling effort and improve counting accuracy is to leverage large amounts of unlabeled images. This is attributed to the inherent self-structural information and rank consistency within a single image, offering additional qualitative relation supervision during training. Contrary to earlier methods that utilized the rank relations at the original image level, we explore such rank-consistency relation within the latent feature spaces. This approach enables the incorporation of numerous pyramid partial orders, strengthening the model representation capability. A notable advantage is that it can also increase the utilization ratio of unlabeled samples. Specifically, we propose a Deep Rank-consist Ent pyrAmid Model (DREAM), which makes full use of rank consistency across coarse-to-fine pyramid features in latent spaces for enhanced crowd counting with massive unlabeled images. In addition, we have collected a new unlabeled crowd counting dataset, FUDAN-UCC, comprising 4000 images for training purposes. Extensive experiments on four benchmark datasets, namely UCF-QNRF, ShanghaiTech PartA and PartB, and UCF-CC-50, show the effectiveness of our method compared with previous semi-supervised methods. The codes are available at https://github.com/bridgeqiqi/DREAM.
Collapse
|
5
|
Zhu K, Guo H, Zhang S, Wang Y, Liu J, Wang J, Tang M. AAformer: Auto-Aligned Transformer for Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17307-17317. [PMID: 37624720 DOI: 10.1109/tnnls.2023.3301856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/27/2023]
Abstract
In person re-identification (re-ID), extracting part-level features from person images has been verified to be crucial to offer fine-grained information. Most of the existing CNN-based methods only locate the human parts coarsely, or rely on pretrained human parsing models and fail in locating the identifiable nonhuman parts (e.g., knapsack). In this article, we introduce an alignment scheme in transformer architecture for the first time and propose the auto-aligned transformer (AAformer) to automatically locate both the human parts and nonhuman ones at patch level. We introduce the "Part tokens ([PART]s)," which are learnable vectors, to extract part features in the transformer. A [PART] only interacts with a local subset of patches in self-attention and learns to be the part representation. To adaptively group the image patches into different subsets, we design the auto-alignment. Auto-alignment employs a fast variant of optimal transport (OT) algorithm to online cluster the patch embeddings into several groups with the [PART]s as their prototypes. AAformer integrates the part alignment into the self-attention and the output [PART]s can be directly used as part features for retrieval. Extensive experiments validate the effectiveness of [PART]s and the superiority of AAformer over various state-of-the-art methods.
Collapse
|
6
|
Tan B, Xiao Y, Wang Y, Li S, Yang J, Cao Z, Zhou JT, Yuan J. Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18186-18199. [PMID: 37729565 DOI: 10.1109/tnnls.2023.3312673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB+D 120, respectively. The source code is available at https://github.com/tangent-T/FACL.
Collapse
|
7
|
Li Y, Liu Y, Zhang H, Zhao C, Wei Z, Miao D. Occlusion-Aware Transformer With Second-Order Attention for Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3200-3211. [PMID: 38687652 DOI: 10.1109/tip.2024.3393360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2024]
Abstract
Person re-identification (ReID) typically encounters varying degrees of occlusion in real-world scenarios. While previous methods have addressed this using handcrafted partitions or external cues, they often compromise semantic information or increase network complexity. In this paper, we propose a new method from a novel perspective, termed as OAT. Specifically, we first use a Transformer backbone with multiple class tokens for diverse pedestrian feature learning. Given that the self-attention mechanism in the Transformer solely focuses on low-level feature correlations, neglecting higher-order relations among different body parts or regions. Thus, we propose the Second-Order Attention (SOA) module to capture more comprehensive features. To address computational efficiency, we further derive approximation formulations for implementing second-order attention. Observing that the importance of semantics associated with different class tokens varies due to the uncertainty of the location and size of occlusion, we propose the Entropy Guided Fusion (EGF) module for multiple class tokens. By conducting uncertainty analysis on each class token, higher weights are assigned to those with lower information entropy, while lower weights are assigned to class tokens with higher entropy. The dynamic weight adjustment can mitigate the impact of occlusion-induced uncertainty on feature learning, thereby facilitating the acquisition of discriminative class token representations. Extensive experiments have been conducted on occluded and holistic person re-identification datasets, which demonstrate the effectiveness of our proposed method.
Collapse
|
8
|
Dai M, Zheng E, Feng Z, Qi L, Zhuang J, Yang W. Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:493-508. [PMID: 38157460 DOI: 10.1109/tip.2023.3346279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
Unmanned Aerial Vehicles (UAVs) rely on satellite systems for stable positioning. However, due to limited satellite coverage or communication disruptions, UAVs may lose signals for positioning. In such situations, vision-based techniques can serve as an alternative, ensuring the self-positioning capability of UAVs. However, most of the existing datasets are developed for the geo-localization task of the objects captured by UAVs, rather than UAV self-positioning. Furthermore, the existing UAV datasets apply discrete sampling to synthetic data, such as Google Maps, neglecting the crucial aspects of dense sampling and the uncertainties commonly experienced in practical scenarios. To address these issues, this paper presents a new dataset, DenseUAV, that is the first publicly available dataset tailored for the UAV self-positioning task. DenseUAV adopts dense sampling on UAV images obtained in low-altitude urban areas. In total, over 27K UAV- and satellite-view images of 14 university campuses are collected and annotated. In terms of methodology, we first verify the superiority of Transformers over CNNs for the proposed task. Then we incorporate metric learning into representation learning to enhance the model's discriminative capacity and to reduce the modality discrepancy. Besides, to facilitate joint learning from both the satellite and UAV views, we introduce a mutually supervised learning approach. Last, we enhance the Recall@K metric and introduce a new measurement, SDM@K, to evaluate both the retrieval and localization performance for the proposed task. As a result, the proposed baseline method achieves a remarkable Recall@1 score of 83.01% and an SDM@1 score of 86.50% on DenseUAV. The dataset and code have been made publicly available on https://github.com/Dmmm1997/DenseUAV.
Collapse
|
9
|
Yu F, Jiang X, Gong Y, Zheng WS, Zheng F, Sun X. Conditional Feature Embedding by Visual Clue Correspondence Graph for Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6188-6199. [PMID: 36126030 DOI: 10.1109/tip.2022.3206617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Although Person Re-Identification has made impressive progress, difficult cases like occlusion, change of view-point, and similar clothing still bring great challenges. In order to tackle these challenges, extracting discriminative feature representation is crucial. Most of the existing methods focus on extracting ReID features from individual images separately. However, when matching two images, we propose that the ReID features of a query image should be dynamically adjusted based on the contextual information from the gallery image it matches. We call this type of ReID features conditional feature embedding. In this paper, we propose a novel ReID framework that extracts conditional feature embedding based on the aligned visual clues between image pairs, called Clue Alignment based Conditional Embedding (CACE-Net). CACE-Net applies an attention module to build a detailed correspondence graph between crucial visual clues in image pairs and uses discrepancy-based GCN to embed the obtained complex correspondence information into the conditional features. The experiments show that CACE-Net achieves state-of-the-art performance on three public datasets.
Collapse
|
10
|
Li K, Wang X, Liu Y, Zhang B, Zhang M. Cross-modality disentanglement and shared feedback learning for infrared-visible person re-identification. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109337] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
11
|
Occluded person re-identification based on differential attention siamese network. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02820-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
12
|
Yu Z, Huang Z, Qin W, Guan T, Zhong Y, Sun D. Joint uneven channel information network with blend metric loss for person re-identification. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00709-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
AbstractPerson re-identification, one of the most challenging tasks in the field of computer vision, aims to recognize the same person cross different cameras. The local feature information has been proved that can improve performance efficiently. Image horizontal even division and pose estimation are two popular methods to extract the local feature. However, the former may cause misalignment, the latter needs much calculation. To fill this gap and improve performance, an efficient strategy is proposed in this work. First, a joint uneven channel information network consisting of an uneven channel information extraction network and a channel information fusion network is designed. Different from the traditional image division, the former can divide images horizontally and unevenly with strong alignment based on weak pose estimation, and extract multiple channel information. The latter can joint channel information based on channel validity and generate an efficient similarity descriptor. To optimize the joint uneven channel information network efficiently, this work proposes a blend metric loss. The extra image information is utilized to dynamically adjust the penalty for the sample distance and the distance margin based on the outlier of the hardest sample to construct i-TriHard loss. Besides, softmax loss and center loss are embedded in the blend metric loss, which can guide the network to learn more discriminative features. Our method achieves 89.6% mAP and 95.9% Rank-1 on Market-1501, 79.9% mAP and 89.4% Rank-1 on DukeMTMC. The proposed method also performs excellently on occluded datasets.
Collapse
|
13
|
Sun Y, Ye Y, Li X, Feng S, Zhang B, Kang J, Dai K. Unsupervised deep hashing through learning soft pseudo label for remote sensing image retrieval. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107807] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
14
|
Chen Y, Xia S, Zhao J, Zhou Y, Niu Q, Yao R, Zhu D, Liu D. ResT-ReID: Transformer Block-based Residual Learning for Person Re-identification. Pattern Recognit Lett 2022. [DOI: 10.1016/j.patrec.2022.03.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
15
|
Yu Z, Qin W, Huang Z, Tahsin L, Sun D, Zhong Y. Joining features by global guidance with bi-relevance trihard loss for person re-identification. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06852-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
16
|
Yu Z, Qin W, Tahsin L, Huang Z. TriEP: Expansion-Pool TriHard Loss for Person Re-Identification. Neural Process Lett 2022. [DOI: 10.1007/s11063-021-10736-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
17
|
Multi-Level Fusion Temporal-Spatial Co-Attention for Video-Based Person Re-Identification. ENTROPY 2021; 23:e23121686. [PMID: 34945992 PMCID: PMC8700156 DOI: 10.3390/e23121686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 11/26/2021] [Accepted: 12/11/2021] [Indexed: 11/27/2022]
Abstract
A convolutional neural network can easily fall into local minima for insufficient data, and the needed training is unstable. Many current methods are used to solve these problems by adding pedestrian attributes, pedestrian postures, and other auxiliary information, but they require additional collection, which is time-consuming and laborious. Every video sequence frame has a different degree of similarity. In this paper, multi-level fusion temporal–spatial co-attention is adopted to improve person re-identification (reID). For a small dataset, the improved network can better prevent over-fitting and reduce the dataset limit. Specifically, the concept of knowledge evolution is introduced into video-based person re-identification to improve the backbone residual neural network (ResNet). The global branch, local branch, and attention branch are used in parallel for feature extraction. Three high-level features are embedded in the metric learning network to improve the network’s generalization ability and the accuracy of video-based person re-identification. Simulation experiments are implemented on small datasets PRID2011 and iLIDS-VID, and the improved network can better prevent over-fitting. Experiments are also implemented on MARS and DukeMTMC-VideoReID, and the proposed method can be used to extract more feature information and improve the network’s generalization ability. The results show that our method achieves better performance. The model achieves 90.15% Rank1 and 81.91% mAP on MARS.
Collapse
|
18
|
Zhang Q, Lai J, Feng Z, Xie X. Seeing Like a Human: Asynchronous Learning With Dynamic Progressive Refinement for Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 31:352-365. [PMID: 34807829 DOI: 10.1109/tip.2021.3128330] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Learning discriminative and rich features is an important research task for person re-identification. Previous studies have attempted to capture global and local features at the same time and layer of the model in a non-interactive manner, which are called synchronous learning. However, synchronous learning leads to high similarity, and further defects in model performance. To this end, we propose asynchronous learning based on the human visual perception mechanism. Asynchronous learning emphasizes the time asynchrony and space asynchrony of feature learning and achieves mutual promotion and cyclical interaction for feature learning. Furthermore, we design a dynamic progressive refinement module to improve local features with the guidance of global features. The dynamic property allows this module to adaptively adjust the network parameters according to the input image, in both the training and testing stage. The progressive property narrows the semantic gap between the global and local features, which is due to the guidance of global features. Finally, we have conducted several experiments on four datasets, including Market1501, CUHK03, DukeMTMC-ReID, and MSMT17. The experimental results show that asynchronous learning can effectively improve feature discrimination and achieve strong performance.
Collapse
|