1
|
Lu Z, Lin R, Hu H. Disentangling Modality and Posture Factors: Memory-Attention and Orthogonal Decomposition for Visible-Infrared Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5494-5508. [PMID: 38619964 DOI: 10.1109/tnnls.2024.3384023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
Striving to match the person identities between visible (VIS) and near-infrared (NIR) images, VIS-NIR reidentification (Re-ID) has attracted increasing attention due to its wide applications in low-light scenes. However, owing to the modality and pose discrepancies exhibited in heterogeneous images, the extracted representations inevitably comprise various modality and posture factors, impacting the matching of cross-modality person identity. To solve the problem, we propose a disentangling modality and posture factors (DMPFs) model to disentangle modality and posture factors by fusing the information of features memory and pedestrian skeleton. Specifically, the DMPF comprises three modules: three-stream features extraction network (TFENet), modality factor disentanglement (MFD), and posture factor disentanglement (PFD). First, aiming to provide memory and skeleton information for modality and posture factors disentanglement, the TFENet is designed as a three-stream network to extract VIS-NIR image features and skeleton features. Second, to eliminate modality discrepancy across different batches, we maintain memory queues of previous batch features through the momentum updating mechanism and propose MFD to integrate features in the whole training set by memory-attention layers. These layers explore intramodality and intermodality relationships between features from the current batch and memory queues under the optimization of the optimal transport (OT) method, which encourages the heterogeneous features with the same identity to present higher similarity. Third, to decouple the posture factors from representations, we introduce the PFD module to learn posture-unrelated features with the assistance of the skeleton features. Besides, we perform subspace orthogonal decomposition on both image and skeleton features to separate the posture-related and identity-related information. The posture-related features are adopted to disentangle the posture factors from representations by a designed posture-features consistency (PfC) loss, while the identity-related features are concatenated to obtain more discriminative identity representations. The effectiveness of DMPF is validated through comprehensive experiments on two VIS-NIR pedestrian Re-ID datasets.
Collapse
|
2
|
Yan S, Tang H, Zhang L, Tang J. Image-Specific Information Suppression and Implicit Local Alignment for Text-Based Person Search. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17973-17986. [PMID: 37713222 DOI: 10.1109/tnnls.2023.3310118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/16/2023]
Abstract
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. In recent years, TBPS has made remarkable progress, and state-of-the-art (SOTA) methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities, which is unreliable due to the lack of contextual information or the potential introduction of noise. Moreover, the existing methods seldom consider the information inequality problem between modalities caused by image-specific information. To address these limitations, we propose an efficient joint multilevel alignment network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels, and realize fast and effective person search. Specifically, we first design an image-specific information suppression (ISS) module, which suppresses image background and environmental factors by relation-guided localization (RGL) and channel attention filtration (CAF), respectively. This module effectively alleviates the information inequality problem and realizes the alignment of information volume between images and texts. Second, we propose an implicit local alignment (ILA) module to adaptively aggregate all pixel/word features of image/text to a set of modality-shared semantic topic centers and implicitly learn the local fine-grained correspondence between modalities without additional supervision and cross-modal interactions. Also, a global alignment (GA) is introduced as a supplement to the local perspective. The cooperation of global and local alignment modules enables better semantic alignment between modalities. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of our MANet.
Collapse
|
3
|
Yan S, Dong N, Zhang L, Tang J. CLIP-Driven Fine-Grained Text-Image Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:6032-6046. [PMID: 37910422 DOI: 10.1109/tip.2023.3327924] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2023]
Abstract
Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation. However, CLIP falls short in capturing fine-grained information, thereby not fully leveraging its powerful capacity in TIReID. Besides, the popular explicit local matching paradigm for mining fine-grained information heavily relies on the quality of local parts and cross-modal inter-part interaction/guidance, leading to intra-modal information distortion and ambiguity problems. Accordingly, in this paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we conduct fine-grained information excavation to mine modality-shared discriminative details for global alignment. Specifically, we propose a multi-level global feature learning (MGF) module that fully mines the discriminative local information within each modality, thereby emphasizing identity-related discriminative clues through enhanced interaction between global image (text) and informative local patches (words). MGF generates a set of enhanced global features for later inference. Furthermore, we design cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules to establish cross-modal correspondence at both coarse and fine-grained levels (image-word, sentence-patch, word-patch), ensuring the reliability of informative local patches/words. CFR and FCD are removed during inference to optimize computational efficiency. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method in TIReID.
Collapse
|
4
|
Jiang J, Xiao J, Wang R, Li T, Zhang W, Ran R, Xiang S. Graph Sampling-Based Multi-Stream Enhancement Network for Visible-Infrared Person Re-Identification. SENSORS (BASEL, SWITZERLAND) 2023; 23:7948. [PMID: 37766005 PMCID: PMC10534846 DOI: 10.3390/s23187948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 09/11/2023] [Accepted: 09/13/2023] [Indexed: 09/29/2023]
Abstract
With the increasing demand for person re-identification (Re-ID) tasks, the need for all-day retrieval has become an inevitable trend. Nevertheless, single-modal Re-ID is no longer sufficient to meet this requirement, making Multi-Modal Data crucial in Re-ID. Consequently, a Visible-Infrared Person Re-Identification (VI Re-ID) task is proposed, which aims to match pairs of person images from the visible and infrared modalities. The significant modality discrepancy between the modalities poses a major challenge. Existing VI Re-ID methods focus on cross-modal feature learning and modal transformation to alleviate the discrepancy but overlook the impact of person contour information. Contours exhibit modality invariance, which is vital for learning effective identity representations and cross-modal matching. In addition, due to the low intra-modal diversity in the visible modality, it is difficult to distinguish the boundaries between some hard samples. To address these issues, we propose the Graph Sampling-based Multi-stream Enhancement Network (GSMEN). Firstly, the Contour Expansion Module (CEM) incorporates the contour information of a person into the original samples, further reducing the modality discrepancy and leading to improved matching stability between image pairs of different modalities. Additionally, to better distinguish cross-modal hard sample pairs during the training process, an innovative Cross-modality Graph Sampler (CGS) is designed for sample selection before training. The CGS calculates the feature distance between samples from different modalities and groups similar samples into the same batch during the training process, effectively exploring the boundary relationships between hard classes in the cross-modal setting. Some experiments conducted on the SYSU-MM01 and RegDB datasets demonstrate the superiority of our proposed method. Specifically, in the VIS→IR task, the experimental results on the RegDB dataset achieve 93.69% for Rank-1 and 92.56% for mAP.
Collapse
Affiliation(s)
- Jinhua Jiang
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China; (J.J.); (J.X.); (T.L.)
| | - Junjie Xiao
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China; (J.J.); (J.X.); (T.L.)
| | - Renlin Wang
- School of Computer Engineering, Weifang University, Weifang 261061, China;
| | - Tiansong Li
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China; (J.J.); (J.X.); (T.L.)
| | - Wenfeng Zhang
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China; (J.J.); (J.X.); (T.L.)
| | - Ruisheng Ran
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China; (J.J.); (J.X.); (T.L.)
| | - Sen Xiang
- School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan 430081, China;
| |
Collapse
|
5
|
Zhou J, Dong Q, Zhang Z, Liu S, Durrani TS. Cross-Modality Person Re-Identification via Local Paired Graph Attention Network. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23084011. [PMID: 37112352 PMCID: PMC10146823 DOI: 10.3390/s23084011] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/07/2023] [Accepted: 04/13/2023] [Indexed: 06/12/2023]
Abstract
Cross-modality person re-identification (ReID) aims at searching a pedestrian image of RGB modality from infrared (IR) pedestrian images and vice versa. Recently, some approaches have constructed a graph to learn the relevance of pedestrian images of distinct modalities to narrow the gap between IR modality and RGB modality, but they omit the correlation between IR image and RGB image pairs. In this paper, we propose a novel graph model called Local Paired Graph Attention Network (LPGAT). It uses the paired local features of pedestrian images from different modalities to build the nodes of the graph. For accurate propagation of information among the nodes of the graph, we propose a contextual attention coefficient that leverages distance information to regulate the process of updating the nodes of the graph. Furthermore, we put forward Cross-Center Contrastive Learning (C3L) to constrain how far local features are from their heterogeneous centers, which is beneficial for learning the completed distance metric. We conduct experiments on the RegDB and SYSU-MM01 datasets to validate the feasibility of the proposed approach.
Collapse
Affiliation(s)
- Jianglin Zhou
- Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin 300387, China
| | - Qing Dong
- Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin 300387, China
| | - Zhong Zhang
- Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin 300387, China
| | - Shuang Liu
- Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin 300387, China
| | - Tariq S. Durrani
- Department of Electronic and Electrical Engineering, University of Strathclyde, Glasgow G1 1QE, UK
| |
Collapse
|
6
|
Uddin MK, Bhuiyan A, Bappee FK, Islam MM, Hasan M. Person Re-Identification with RGB-D and RGB-IR Sensors: A Comprehensive Survey. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23031504. [PMID: 36772548 PMCID: PMC9919319 DOI: 10.3390/s23031504] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Revised: 01/14/2023] [Accepted: 01/21/2023] [Indexed: 05/27/2023]
Abstract
Learning about appearance embedding is of great importance for a variety of different computer-vision applications, which has prompted a surge in person re-identification (Re-ID) papers. The aim of these papers has been to identify an individual over a set of non-overlapping cameras. Despite recent advances in RGB-RGB Re-ID approaches with deep-learning architectures, the approach fails to consistently work well when there are low resolutions in dark conditions. The introduction of different sensors (i.e., RGB-D and infrared (IR)) enables the capture of appearances even in dark conditions. Recently, a lot of research has been dedicated to addressing the issue of finding appearance embedding in dark conditions using different advanced camera sensors. In this paper, we give a comprehensive overview of existing Re-ID approaches that utilize the additional information from different sensor-based methods to address the constraints faced by RGB camera-based person Re-ID systems. Although there are a number of survey papers that consider either the RGB-RGB or Visible-IR scenarios, there are none that consider both RGB-D and RGB-IR. In this paper, we present a detailed taxonomy of the existing approaches along with the existing RGB-D and RGB-IR person Re-ID datasets. Then, we summarize the performance of state-of-the-art methods on several representative RGB-D and RGB-IR datasets. Finally, future directions and current issues are considered for improving the different sensor-based person Re-ID systems.
Collapse
Affiliation(s)
- Md Kamal Uddin
- Interactive Systems Lab, Graduate School of Science and Engineering, Saitama University, Saitama 338-8570, Japan
- Department of Computer Science and Telecommunication Engineering, Noakhali Science and Technology University, Noakhali 3814, Bangladesh
| | - Amran Bhuiyan
- Information Retrieval and Knowledge Management Research Laboratory, York University, Toronto, ON M3J 1P3, Canada
| | - Fateha Khanam Bappee
- Department of Computer Science and Telecommunication Engineering, Noakhali Science and Technology University, Noakhali 3814, Bangladesh
| | - Md Matiqul Islam
- Interactive Systems Lab, Graduate School of Science and Engineering, Saitama University, Saitama 338-8570, Japan
- Department of Information and Communication Engineering, University of Rajshahi, Rajshahi 6205, Bangladesh
| | - Mahmudul Hasan
- Interactive Systems Lab, Graduate School of Science and Engineering, Saitama University, Saitama 338-8570, Japan
- Department of Computer Science and Engineering, Comilla University, Kotbari 3506, Bangladesh
| |
Collapse
|
7
|
Zhao Q, Wu H, Zhu J. Margin-Based Modal Adaptive Learning for Visible-Infrared Person Re-Identification. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23031426. [PMID: 36772466 PMCID: PMC9921303 DOI: 10.3390/s23031426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Revised: 01/15/2023] [Accepted: 01/24/2023] [Indexed: 05/14/2023]
Abstract
Visible-infrared person re-identification (VIPR) has great potential for intelligent transportation systems for constructing smart cities, but it is challenging to utilize due to the huge modal discrepancy between visible and infrared images. Although visible and infrared data can appear to be two domains, VIPR is not identical to domain adaptation as it can massively eliminate modal discrepancies. Because VIPR has complete identity information on both visible and infrared modalities, once the domain adaption is overemphasized, the discriminative appearance information on the visible and infrared domains would drain. For that, we propose a novel margin-based modal adaptive learning (MMAL) method for VIPR in this paper. On each domain, we apply triplet and label smoothing cross-entropy functions to learn appearance-discriminative features. Between the two domains, we design a simple yet effective marginal maximum mean discrepancy (M3D) loss function to avoid an excessive suppression of modal discrepancies to protect the features' discriminative ability on each domain. As a result, our MMAL method could learn modal-invariant yet appearance-discriminative features for improving VIPR. The experimental results show that our MMAL method acquires state-of-the-art VIPR performance, e.g., on the RegDB dataset in the visible-to-infrared retrieval mode, the rank-1 accuracy is 93.24% and the mean average precision is 83.77%.
Collapse
Affiliation(s)
- Qianqian Zhao
- College of Information Science and Engineering, Huaqiao University, Xiamen 361021, China
| | - Hanxiao Wu
- College of Information Science and Engineering, Huaqiao University, Xiamen 361021, China
| | - Jianqing Zhu
- College of Engineering, Huaqiao University, Quanzhou 362021, China
- Xiamen Yealink Network Technology Company Limited, Xiamen 361015, China
- Correspondence:
| |
Collapse
|
8
|
Robust non-negative supervised low-rank discriminant embedding (NSLRDE) for feature extraction. INT J MACH LEARN CYB 2023. [DOI: 10.1007/s13042-022-01752-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
9
|
Zhang Z, Wu J, Chen Y, Wang J, Xu J. Distinguish between Stochastic and Chaotic Signals by a Local Structure-Based Entropy. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1752. [PMID: 36554157 PMCID: PMC9778404 DOI: 10.3390/e24121752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 11/26/2022] [Accepted: 11/27/2022] [Indexed: 06/17/2023]
Abstract
As a measure of complexity, information entropy is frequently used to categorize time series, such as machinery failure diagnostics, biological signal identification, etc., and is thought of as a characteristic of dynamic systems. Many entropies, however, are ineffective for multivariate scenarios due to correlations. In this paper, we propose a local structure entropy (LSE) based on the idea of a recurrence network. Given certain tolerance and scales, LSE values can distinguish multivariate chaotic sequences between stochastic signals. Three financial market indices are used to evaluate the proposed LSE. The results show that the LSEFSTE100 and LSES&P500 are higher than LSESZI, which indicates that the European and American stock markets are more sophisticated than the Chinese stock market. Additionally, using decision trees as the classifiers, LSE is employed to detect bearing faults. LSE performs higher on recognition accuracy when compared to permutation entropy.
Collapse
Affiliation(s)
- Zelin Zhang
- School of Mathematics, Physics and Optoelectronic Engineering, Hubei University of Automotive Technology, Shiyan 442002, China
- Hubei Key Laboratory of Applied Mathematics, Hubei University, Wuhan 430061, China
| | - Jun Wu
- School of Mathematics, Physics and Optoelectronic Engineering, Hubei University of Automotive Technology, Shiyan 442002, China
- Hubei Key Laboratory of Applied Mathematics, Hubei University, Wuhan 430061, China
| | - Yufeng Chen
- School of Electrical and Information Engineering, Hubei University of Automotive Technology, Shiyan 442002, China
| | - Ji Wang
- School of Liberal Arts and Humanities, Sichuan Vocational College of Finance and Economics, Chengdu 610101, China
| | - Jinyu Xu
- School of Electrical and Information Engineering, Hubei University of Automotive Technology, Shiyan 442002, China
| |
Collapse
|
10
|
Abstract
Feature extraction is an important part of perceptual hashing. How to compress the robust features of images into hash codes has become a hot research topic. Converting a two-dimensional image into a one-dimensional descriptor requires a higher computational cost and is not optimal. In order to maintain the internal feature structure of the original two-dimensional image, a new Bilinear Supervised Neighborhood Discrete Discriminant Hashing (BNDDH) algorithm is proposed in this paper. Firstly, the algorithm constructs two new neighborhood graphs to maintain the geometric relationship between samples and reduces the quantization loss by directly constraining the hash codes. Secondly, two small rotation matrices are used to realize the bilinear projection of the two-dimensional descriptor. Finally, the experiment verifies the performance of the BNDDH algorithm under different feature types, such as image original pixels and a Convolutional Neural Network (CNN)-based AlexConv5 feature. The experimental results and discussion clearly show that the proposed BNDDH algorithm is better than the existing traditional hashing algorithm and can represent the image more efficiently in this paper.
Collapse
|